簡易檢索 / 詳目顯示

研究生: 吳東樺
Dong-Hua Wu
論文名稱: 集合關聯式之載入/儲存快取記憶體
SALSC: Set-Associative Load/Store Caches
指導教授: 黃元欣
Yuan-Shin Hwang
口試委員: 黃冠寰
Gwan-Hwan Hwang
謝仁偉
Jen-Wei Hsieh
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 51
中文關鍵詞: 載入儲存佇列功率消耗
外文關鍵詞: LSQ, Load Store Queue, Energy-efficiency
相關次數: 點閱:150下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

傳統的載入/儲存佇列(Load / Store Queue, LSQ)是一種(Content-Address Memory, CAM)的結構,在動態排程(dynamically-scheduled)的處理器裡儲存所有執行中(in-flight)的記憶體指令,利用完全關聯集合(fully-associative)與順序邏輯(ordering-logic)來搜尋以維護相依性與達成資料前送(data forwarding)。在過去的研究顯示LSQ的相依違反並不常見也因此較沒效率,在CAM的架構上也過於複雜不適合擴展。本文提供一個有效率且可擴展並不同於LSQ的架構,稱為集合關聯載入/儲存快取記憶體(Set-Associative Load/Store Cache - SALSC),用集合關聯標籤(tag)陣列將CAM的結構替換掉。類似用集合關聯快取記憶體來取代完全關聯快取記憶體,因為完全關聯的標籤與位元陣列是一個CAM。如同已觀察的集合關聯快取記憶體可以大幅減少標籤的比對數,又接近完全關聯快取記憶體的失誤率,因此SALSC可以大幅減少搜尋頻寬的需求,並且沒有因集合衝突的影響遭受效能嚴重的下降。此外SALSC可以看成一個與年齡邏輯(age logic)整合的集合關聯快取記憶體,因此可以自然的將SALSC直接延伸當成L0的快取記憶體,可用來緩衝記憶體的資料在項目(entry)裡。在SPECint2000的執行顯示32-entry與128-entry的4-way SALSC可以大幅減少搜尋頻寬的需求且效能沒有明顯下降,在128-entry L0 SALSC更可以改進0.22%的平均執行時間。


The conventional load/store queue (LSQ) is a CAM structure where a dynamically-scheduled processor stores all in-flight memory instructions and conducts fully associative, ordering-logic searches to maintain dependencies and perform forwarding. LSQ is neither efficient since previous studies have shown that dependency violations are infrequent, nor scalable due to the complexity of the CAM. This paper presents an efficient and scalable alternative to the LSQ, called the set-associative load/store cache (SALSC), that replaces the CAM with a set-associative tag array. It is analogous to substituting a set-associative cache for a fully associative cache, since the tag bit cell of a fully-associative array is a CAM. As it has been observed that set-associative caches can significantly reduce tag comparisons while approximating the miss rates of fully associative caches, SALSC can substantially lessen the search bandwidth demand without incurring noticeable performance degradation due to stalls caused by set conflicts. Furthermore, an SALSC can be viewed as a set-associative cache integrated with an age logic, and hence it is a natural and straightforward extension to treat an SALSC as an L0 cache by buffering data of memory references in the entries. Experimental results of SPECint2000 benchmarks show that both a 32-entry and a 128-entry 4-way SALSC can significantly reduce the search bandwidth demand with no visible performance penalties, while a 128-entry L0 SALSC can improve the average execution times by 0.22%.

論文摘要 I ABSTRACT III 誌 謝 IV 目錄 V 圖目錄 VII 表目錄 VIII 第一章 序論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 研究方法 4 1.4 論文架構 5 第二章 文獻探討 6 2.1 研究背景 6 2.2 載入/儲存佇列 6 2.3 完全關聯式快取 8 2.4 相關工作 9 2.4.1 循序配置LSQ 9 2.4.2 位址索引配置LSQ 10 2.4.3 其它方法 12 第三章 研究方法 13 3.1 集合式關聯之載入/儲存快取 13 3.2 配置策略 14 3.3 運作流程 15 3.4 年齡邏輯 16 3.5 避免死結 17 第四章 實驗結果 19 4.1 實驗設定 19 4.2效能評估 20 4.2.1 執行影響 21 4.2.2 減少搜尋頻寬需求 23 4.2.3 集合溢出 26 4.2.4 配置策略 28 4.2.5 重新執行 29 4.2.6 增加容量 30 4.2.7 視 SALSC 為較小的L0快取記憶體 33 第五章 結論與未來展望 36 5.1 結論 36 5.2 未來展望 37 參考文獻 38

[1] Lee Baugh and Craig Zille. Decomposing the load-store queue by function for power reduction and scalability. IBM Journal of Research and Development, 50(2/3):287–297, March/May 2006.
[2] I. Park, C.L. Ooi, and T.N. Vijaykumar. Reducing design complexity of the load/store queue. In Proceedings of the 36th Annual ACM/IEEE International Symposium on Microarchitecture, pages 411–422, 2003.
[3] Simha Sethumadhavan, Rajagopalan Desikan, Doug Burger, Charles R. Moore, and Stephen W. Keckler. Scalable hardware memory disambiguation for high ILP processors. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 399–410, 2003.
[4] Tingting Sha, Milo M. K. Martin, and Amir Roth. Scalable store-load forwarding via store queue index prediction. In Pro-ceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pages 159 – 170. IEEE Computer Society, 2005.
[5] George Z. Chrysos and Joel S. Emer. Memory dependence prediction using store sets. In Proceedings of the 25th International Symposium on Computer Architecture, pages 142–153, June 1998
[6] Andreas I. Moshovos, Scott E. Breach, T.N. Vijaykumar, and Gurindar S. Sohi. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th International Conference on Computer Architecture, pages 181–193, June 1997.
[7] Premkishore Shivakumar and Norman P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical Report WRL-2001/2, Western Research Laboratory, Compaq, August 2001.Standard Performance Evaluation Corporation. SPEC CPU2000 v1.1, 2000.
[8] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantative Approach. Morgan Kaufmann, 3rd edition, 2003.
[9] D. Nicolaescu, A. Veidenbaum, and A. Nicolau. Reducing data cache energy consumption via cached load/store queue. In Pro-ceedings of the 2003 International Symposium on Low Power Electronics and Design, pages 252–257, 2003.
[10] T. Austin, E. Larson, and D. Emst. Simplescalar: An infrastructure for computer system modeling. IEEE Computer, 35(2):59–67, 2002.
[11] R.E. Kessler. The Alpha 21264 microprocessor. IEEE Micro, 19(2):24–36, 1999.
[12] Alok Garg, M. Wasiur Rashid, and Michael Huang. Slackened memory dependence enforcement:Combining opportunistic for-warding with decoupled verification. In Proceedings of the 33rd annual international symposium on Computer Architecture, pages 142 – 154. IEEE Computer Society, 2006.
[13] E. F. Torres, P. Ibanez, V. Vinals, and J. M. Llaberia. Store buffer design in first-level multibanked data caches. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 469 – 480. IEEE Computer Society, 2005.
[14] Manoj Franklin and Gurindar S. Sohi. Arb: a hardware mechanism for dynamic reordering of memory references. IEEE Transactions on Computers, 45:552 – 571, May 1996.
[15] Sam S. Stone, Kevin M.Woley, and Matthew I. Frank. Ad-dress-indexed memory disambiguation and store-to-load forwarding. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pages 171 – 182, 2005.
[16] Jaume Alblla and alez Antonio Gonz. Samie-lsq: set-associative multiple-instruction entry load/store queue. In Proceedings of 20th International Parallel and Distributed Processing Symposium, 2006.
[17] Amir Roth. Store vulnerability window (svw): Re-execution filtering for enhanced load optimization. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 458 – 468, 2005.
[18] Tingting Sha, Milo M.K. Martin, and Amir Roth. Nosq: Store-load communication without a store queue. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39), pages 285 – 296, 2006.
[19] G.H.; Subramaniam, S.; Loh. Fire-and-forget: Load/store scheduling with no store queue at all. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 273 – 284, 2006.
[20] Samantika Subramaniam and Gabriel H. Loh. Store vectors for scalable memory dependence prediction and scheduling. In Pro-ceedings of the Twelfth International Symposium on High- Performance Computer Architecture, pages 65 – 76, 2006.
[21] Ruke Huang, Alog Garg, and Michael Huang. Software-hardware cooperative memory disambiguation. In Proceedings of The Twelfth International Symposium on High-Performance Computer Architecture, pages 244– 253, 2006.
[22] Alok Garg, Fernando Castro, Michael Huang, Daniel Chaver, Luis Pin uel, and Manuel Prieto. Substituting associative load queue with simple hash tables in out-of-order microprocessors. In Proceedings of the 2006 international symposium on Low power electronics and design, pages 268 – 273, 2006.
[23] Kanad Ghose Gurhan Kucuk, Dmitry Ponomarev. Low-complexity reorder buffer architecture. In Proceedings of the 16th international conference on Supercomputing, pages 57–66. ACM Press, June 2002.
[24] Standard Performance Evaluation Corporation. SPEC CPU2000 v1.1, 2000.
[25] ChrisWeaver. SPEC 2000 binaries. http://www.eecs.umich.edu/_chriswea/benchmarks/spec2000.html.

QR CODE