研究生: |
林于勝 Yi-sheng Lin |
---|---|
論文名稱: |
以NoC架構為基礎的稀疏矩陣乘法系統設計 The Design of a NoC-Based Sparse Matrix Multiplication System |
指導教授: |
林銘波
Ming-bo Lin |
口試委員: |
陳郁堂
Yie-tarng Chen 林昌鴻 Chang-hong Lin 白英文 Ying-wen Bai 楊兆華 none |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 中文 |
論文頁數: | 70 |
中文關鍵詞: | 晶片網路 、二維網格 、蟲洞路由器 、稀疏矩陣乘法 |
外文關鍵詞: | Network-on-Chip, 2D mesh, wormhole router, sparse matrix multiplication |
相關次數: | 點閱:186 下載:10 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著深次微米製程技術的發展,單一晶片可以容納更多的矽智財。然而,傳統匯流排 (bus) 無法應付多核心系統大量傳輸的需求。本論文提出一個適用於稀疏矩陣乘法設計的晶片網路 (Network-on-Chip) 硬體架構,並利用平行運算的方式改善稀疏矩陣乘法的效能與快取失誤問題。本論文的硬體架構包含路由器、網路介面和處理單元。路由器的設計採用蟲洞 (wormhole) 的封包交換技術,透過管線化的設計使封包傳輸更有效率。封包的路徑規劃則以 XY 路由演算法實現,具有低成本與低延遲的特性。我們也針對高維度的矩陣提出矩陣映射與分割的方法,增加網路的負載平衡和運算的效率。除此之外,建立在二維網格 (2D mesh) 拓撲上的路由器可以透過參數化的設計實現各種大小的晶片網路。
我們已經在 Xilinx Virtex 5 FPGA 和 TSMC 0.18 um 元件庫上實現 2×2 、 2×4 、 4×4 、 4×8 和 8×8 的晶片網路。在 FPGA 部分,實現的硬體使用隨機產生的矩陣和科學與工程應用的矩陣進行測試,並分析網路大小、矩陣大小和稀疏度對效能的影響。另外,我們也比較晶片網路的架構對 MicroBlaze 和 Intel 處理器的效能差異,最大分別可達 40 倍和 2 倍的加速度比。在 0.18 um 元件庫的部分,工作頻率皆可達到 166MHz ,其中 4×4 的晶片網路,晶片的核心面積為 1,986.5 um×1,985.4 um ,等效閘數為 259,026 ,平均功率消耗為 417.40 mW 。
With the advance of deep-submicron technologies, a huge number of IP blocks can be integrated into a single chip. However, the bus-based architecture becomes a performance bottleneck for the multicore SoC. In this thesis, a NoC-based architecture for sparse matrix multiplication is proposed to improve the performance of parallel computation. The architecture consists of routers, network interfaces, and processing elements. Our router is efficient by using the pipeline technique and wormhole switching with the XY routing algorithm. We also present a method of mapping and partitioning for large matrices to increase the load balancing and efficiency of packet distribution. In addition, the mesh-based network is fully parameterized so that it is flexible.
Various sizes of the NoC-based architecture have been implemented, including 2×2, 2×4, 4×4, 4×8 ,and 8×8, on Xilinx Virtex 5 and TSMC 0.18 um cell library. In the FPGA implementation, the performance of our design is evaluated by a number of random and real-application matrices. Besides, the effects of network sizes, matrix sizes, and sparsity on the system performance are considered. Compared with MicroBlaze and Intel processors, our design achieves up to 40x and 2x speedup respectively. In the ASIC implementation, the core area of the 4×4 NoC architecture is 1,986.5 um×1,985.4 um, which is equivalent to 259,026 gates. The average power consumption is 417 mW at the operating frequency of 166 MHz.
[1] L. Benini and G. De Micheli, “Networks on chips: a new SoC paradigm,” Computer, vol. 35, pp. 70–78, Jan. 2002.
[2] S. Vangal, J. Howard, and G. Ruhl et al., “An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS,” IEEE Journal of Solid State Circuits, vol. 43, pp.
29–41, Jan. 2008.
[3] Tilera, “TILE-GX precessor family.” http://www.tilera.com/products/processors.
[4] S. Bell, B. Edwards, and J. Amann et al., “TILE64 processor: a 64-core SoC with mesh interconnect,” Proceedings of IEEE International Solid-State Circuits Conference (ISSCC 2008), pp. 88–598, Feb. 2008.
[5] ARM, “CoreLink network interconnect for AMBA AXI.”http://www.arm.com/products/system-ip/interconnect/axi/index.php.
[6] S. Toledo, “Improving the memory system performance of sparse matrix-vector multiplication,” IBM Journal of Research and Development, vol. 41, pp. 711–725, Nov. 1997.
[7] E.-J. Im, K. Yelick, and R. Vuduc, “SPARSITY: optimization framework for sparse matrix kernels,” International Journal of High Performance Computing Applications, vol. 18, pp. 135–158, Feb. 2004.
[8] L. Zhang, Z. Fang, and M. Parker et al., “The impulse memory controller,” IEEE Transactions on Computers, vol. 50, pp. 1117–1132, Nov. 2001.
[9] G. Morris and V. Prasanna, “Sparse matrix computations on reconfigurable hardware,” Computer, vol. 40, pp. 58–64, Mar. 2007.
[10] S. Williams, L. Oliker, and V. Richard et al., “Optimization of sparse matrixvector multiplication on emerging multicore platforms,” Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pp. 1–12, Mar. 2007.
[11] N. Bell and M. Garland, “Efficient sparse matrix-vector multiplication on CUDA,” NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.
[12] L. Zhuo and V. Prasanna, “Sparse matrix-vector multiplication on FPGAs,” Proceedings of the 13th International Symposium on Field-Programmable Gate Arrays (FPGA '05), pp. 63–74, Feb. 2005.
[13] J. Yang, C. Chun, and N. Bagherzadeh et al., “Load balancing for data-parallel applications on network-on-chip enabled multi-processor platform,” Proceedings of the 19th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pp. 439–446, Feb. 2011.
[14] H. G. Lee, N. Chang, and U. Y. Ogras et al., “On-chip communication architecture exploration: a quantitative evaluation of point-to-point, bus, and network-on-chip approaches,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 12, pp. 1–20, Aug. 2008.
[15] J. Duato, S.Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, 2003.
[16] W. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, 2005.
[17] S. Scott, D. Abts, and J. Kim et al., “The blackWidow high-radix clos network,” Proceedings of the 33rd International Symposium on Computer Architecture, pp. 16–28, Jul. 2006.
[18] E. Salminen, A. Kulmala, and T. D. Hmlinen, “Survey of network-on-chip proposals.” http://www.ocpip.org, Mar. 2008.
[19] W. J. Dally, “Virtual-channel flow control,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 194–205, Mar. 1992.
[20] L. G. Valiant and G. J. Brebner, “Universal schemes for parallel communication,” Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC'81), pp. 263–277, 1981.
[21] T. Nesson and S. L. Johnsson, “ROMM routing on mesh and torus networks,” Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'95), pp. 275–287, Jul. 1995.
[22] C. Glass and L. Ni, “The turn model for adaptive routing,” Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 278–287, May 1992.
[23] G.-M. Chiu, “The odd-even turn model for adaptive routing,” IEEE Transactions on Parallel and Distributed Systems, vol. 11, pp. 729–738, Jul. 2000.
[24] Z. Bai, J. Demmel, and J. Dongarra et al., Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, 2000.
[25] C. Nicopoulos, D. Park, and J. Kim et al., “ViChaR: a dynamic virtual channel regulator for network-on-chip routers,” Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture, pp. 333–346, Dec. 2006.
[26] J. Hu and R. Marculescu, “DyAD - smart routing for networks-on-chip,” Proceedings of the 41st Annual Conference on Design Automation, pp. 260–263, Jul. 2004.
[27] T. Davis and Y. Hu, “The university of florida sparse matrix collection.” http://www.cise.ufl.edu/research/sparse/matrices.