以NoC架構為基礎的稀疏矩陣乘法系統設計｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	林于勝 Yi-sheng Lin
論文名稱：	以NoC架構為基礎的稀疏矩陣乘法系統設計 The Design of a NoC-Based Sparse Matrix Multiplication System
指導教授：	林銘波 Ming-bo Lin
口試委員:	陳郁堂 Yie-tarng Chen 林昌鴻 Chang-hong Lin 白英文 Ying-wen Bai 楊兆華 none
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2012
畢業學年度：	100
語文別：	中文
論文頁數：	70
中文關鍵詞：	晶片網路、二維網格、蟲洞路由器、稀疏矩陣乘法
外文關鍵詞：	Network-on-Chip, 2D mesh, wormhole router, sparse matrix multiplication
相關次數：	點閱：186 下載：10
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著深次微米製程技術的發展，單一晶片可以容納更多的矽智財。然而，傳統匯流排 (bus) 無法應付多核心系統大量傳輸的需求。本論文提出一個適用於稀疏矩陣乘法設計的晶片網路 (Network-on-Chip) 硬體架構，並利用平行運算的方式改善稀疏矩陣乘法的效能與快取失誤問題。本論文的硬體架構包含路由器、網路介面和處理單元。路由器的設計採用蟲洞 (wormhole) 的封包交換技術，透過管線化的設計使封包傳輸更有效率。封包的路徑規劃則以 XY 路由演算法實現，具有低成本與低延遲的特性。我們也針對高維度的矩陣提出矩陣映射與分割的方法，增加網路的負載平衡和運算的效率。除此之外，建立在二維網格 (2D mesh) 拓撲上的路由器可以透過參數化的設計實現各種大小的晶片網路。
我們已經在 Xilinx Virtex 5 FPGA 和 TSMC 0.18 um 元件庫上實現 2×2 、 2×4 、 4×4 、 4×8 和 8×8 的晶片網路。在 FPGA 部分，實現的硬體使用隨機產生的矩陣和科學與工程應用的矩陣進行測試，並分析網路大小、矩陣大小和稀疏度對效能的影響。另外，我們也比較晶片網路的架構對 MicroBlaze 和 Intel 處理器的效能差異，最大分別可達 40 倍和 2 倍的加速度比。在 0.18 um 元件庫的部分，工作頻率皆可達到 166MHz ，其中 4×4 的晶片網路，晶片的核心面積為 1,986.5 um×1,985.4 um ，等效閘數為 259,026 ，平均功率消耗為 417.40 mW 。

With the advance of deep-submicron technologies, a huge number of IP blocks can be integrated into a single chip. However, the bus-based architecture becomes a performance bottleneck for the multicore SoC. In this thesis, a NoC-based architecture for sparse matrix multiplication is proposed to improve the performance of parallel computation. The architecture consists of routers, network interfaces, and processing elements. Our router is efficient by using the pipeline technique and wormhole switching with the XY routing algorithm. We also present a method of mapping and partitioning for large matrices to increase the load balancing and efficiency of packet distribution. In addition, the mesh-based network is fully parameterized so that it is flexible.
Various sizes of the NoC-based architecture have been implemented, including 2×2, 2×4, 4×4, 4×8 ,and 8×8, on Xilinx Virtex 5 and TSMC 0.18 um cell library. In the FPGA implementation, the performance of our design is evaluated by a number of random and real-application matrices. Besides, the effects of network sizes, matrix sizes, and sparsity on the system performance are considered. Compared with MicroBlaze and Intel processors, our design achieves up to 40x and 2x speedup respectively. In the ASIC implementation, the core area of the 4×4 NoC architecture is 1,986.5 um×1,985.4 um, which is equivalent to 259,026 gates. The average power consumption is 417 mW at the operating frequency of 166 MHz.

第一章 序論 1
1 研究動機 1
2 研究方向 2
3 章節安排 3
第二章 晶片網路的背景知識 4
1 匯流排與晶片網路 4
1.1 匯流排 4
1.2 晶片網路 6
2 網路拓撲 7
3 流控方法 10
3.1 線路交換 10
3.2 封包交換 11
3.3 儲存轉送 13
3.4 虛擬直通 13
3.5 蟲洞 14
3.6 虛擬通道 15
4 路由演算法 19
4.1 死鎖問題 19
4.2 非適應性演算法 20
4.3 適應性演算法 21
第三章 矩陣乘法與傳統計算機架構 23
1 傳統計算機架構與快取失誤問題 23
2 稀疏矩陣儲存格式 25
第四章 晶片網路的設計 28
1 網路架構 28
2 路由器的設計 29
2.1 輸入單元 30
2.2 分配器 34
2.3 Crossbar 38
2.4 路由器的管線化設計 38
3 處理單元的設計 41
3.1 封包格式 41
3.2 網路介面 43
3.3 算術運算單元與儲存單元 45
3.4 矩陣的映射與分割方式 48
第五章 實驗結果與效能評估 51
1 實驗架構 51
2 效能評估 53
3 MicroBlaze與Intel處理器的效能比較 58
第六章 FPGA驗證與元件庫實現 62
1 FPGA的實體驗證 62
2 元件庫的實現 63
2.1 元件庫合成與效能分析 64
2.2 自動化佈局 64
第七章 結論 67
參考文獻 68

                                

[1] L. Benini and G. De Micheli, “Networks on chips: a new SoC paradigm,” Computer, vol. 35, pp. 70–78, Jan. 2002.
[2] S. Vangal, J. Howard, and G. Ruhl et al., “An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS,” IEEE Journal of Solid State Circuits, vol. 43, pp.
29–41, Jan. 2008.
[3] Tilera, “TILE-GX precessor family.” http://www.tilera.com/products/processors.
[4] S. Bell, B. Edwards, and J. Amann et al., “TILE64 processor: a 64-core SoC with mesh interconnect,” Proceedings of IEEE International Solid-State Circuits Conference (ISSCC 2008), pp. 88–598, Feb. 2008.
[5] ARM, “CoreLink network interconnect for AMBA AXI.”http://www.arm.com/products/system-ip/interconnect/axi/index.php.
[6] S. Toledo, “Improving the memory system performance of sparse matrix-vector multiplication,” IBM Journal of Research and Development, vol. 41, pp. 711–725, Nov. 1997.
[7] E.-J. Im, K. Yelick, and R. Vuduc, “SPARSITY: optimization framework for sparse matrix kernels,” International Journal of High Performance Computing Applications, vol. 18, pp. 135–158, Feb. 2004.
[8] L. Zhang, Z. Fang, and M. Parker et al., “The impulse memory controller,” IEEE Transactions on Computers, vol. 50, pp. 1117–1132, Nov. 2001.
[9] G. Morris and V. Prasanna, “Sparse matrix computations on reconfigurable hardware,” Computer, vol. 40, pp. 58–64, Mar. 2007.
[10] S. Williams, L. Oliker, and V. Richard et al., “Optimization of sparse matrixvector multiplication on emerging multicore platforms,” Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pp. 1–12, Mar. 2007.
[11] N. Bell and M. Garland, “Efficient sparse matrix-vector multiplication on CUDA,” NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.
[12] L. Zhuo and V. Prasanna, “Sparse matrix-vector multiplication on FPGAs,” Proceedings of the 13th International Symposium on Field-Programmable Gate Arrays (FPGA '05), pp. 63–74, Feb. 2005.
[13] J. Yang, C. Chun, and N. Bagherzadeh et al., “Load balancing for data-parallel applications on network-on-chip enabled multi-processor platform,” Proceedings of the 19th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 439–446, Feb. 2011.
[14] H. G. Lee, N. Chang, and U. Y. Ogras et al., “On-chip communication architecture exploration: a quantitative evaluation of point-to-point, bus, and network-on-chip approaches,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 12, pp. 1–20, Aug. 2008.
[15] J. Duato, S.Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, 2003.
[16] W. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, 2005.
[17] S. Scott, D. Abts, and J. Kim et al., “The blackWidow high-radix clos network,” Proceedings of the 33rd International Symposium on Computer Architecture, pp. 16–28, Jul. 2006.
[18] E. Salminen, A. Kulmala, and T. D. Hmlinen, “Survey of network-on-chip proposals.” http://www.ocpip.org, Mar. 2008.
[19] W. J. Dally, “Virtual-channel flow control,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 194–205, Mar. 1992.
[20] L. G. Valiant and G. J. Brebner, “Universal schemes for parallel communication,” Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC'81), pp. 263–277, 1981.
[21] T. Nesson and S. L. Johnsson, “ROMM routing on mesh and torus networks,” Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'95), pp. 275–287, Jul. 1995.
[22] C. Glass and L. Ni, “The turn model for adaptive routing,” Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 278–287, May 1992.
[23] G.-M. Chiu, “The odd-even turn model for adaptive routing,” IEEE Transactions on Parallel and Distributed Systems, vol. 11, pp. 729–738, Jul. 2000.
[24] Z. Bai, J. Demmel, and J. Dongarra et al., Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, 2000.
[25] C. Nicopoulos, D. Park, and J. Kim et al., “ViChaR: a dynamic virtual channel regulator for network-on-chip routers,” Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture, pp. 333–346, Dec. 2006.
[26] J. Hu and R. Marculescu, “DyAD - smart routing for networks-on-chip,” Proceedings of the 41st Annual Conference on Design Automation, pp. 260–263, Jul. 2004.
[27] T. Davis and Y. Hu, “The university of florida sparse matrix collection.” http://www.cise.ufl.edu/research/sparse/matrices.

簡易檢索 / 詳目顯示

相關論文