簡易檢索 / 詳目顯示

研究生: 鄭弘元
Hong-Yuan Jheng
論文名稱: 植基於晶片網路使用區域可程式規劃邏輯閘陣列加速稀疏矩陣向量乘法運算
FPGA Acceleration of Sparse Matrix-Vector Multiplication Based on Network-on-Chip
指導教授: 阮聖彰
Shanq-Jang Ruan
口試委員: 熊博安
Pao-Ann Hsiung
許孟超
Mon-Chau Shie
蔡坤霖
Kun-Lin Tsai
林昌鴻
Chang-Hong Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 56
中文關鍵詞: 稀疏矩陣稀疏矩陣向量乘法可程式規劃邏輯閘陣列晶片網路遞迴演算法
外文關鍵詞: sparse matrix, Sparse Matrix-Vector Multiplication (SMVM), FPGA, Network-on-Chip (NoC), iterative algorithm
相關次數: 點閱:226下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 稀疏矩陣向量乘法運算被廣泛應用於多種科學與工程領域。但因稀疏矩陣向量乘法含有大量的浮點運算,所以將會造成絕大部份的遞迴線性系統解題器效能嚴重下降。在涉及稀疏矩陣向量乘法的各項運算中,又因其大量記憶體存取及不規則存取等特性,而難以最佳化該運算。本論文提出了一個結合晶片網路架構並實作於區域可程式規劃邏輯閘陣列的稀疏矩陣向量乘法設計概念。在傳統電路設計中,通常採用點對點連接或共享匯流排等方式來設計晶片內的資料傳輸,因在大部份的平行處理實作中,皆僅考慮規則性資料傳輸。然而當執行稀疏矩陣向量乘法運算時,資料的傳遞行為則是視目標矩陣的稀疏結構而定,且其目標矩陣通常極為稀疏且不規則,而若採用晶片網路的結構來實作,則能更有效率的處理任意稀疏結構矩陣的資料傳遞。此外本架構擁有極高的延展性,網路大小可自訂為2×2、4×4至p×p(p∈N)。在本實作中,所有數學運算皆採用IEEE-754標準單精度浮點數格式並實作於Xilinx Vritex-6開發板。實驗結果顯示,本論文所提出之植基於晶片網路架構之設計在Matrix Market評測的結果中,相較於軟體實作方法,本實作能夠達到提升2.3倍至5.6倍速度的效果。


    The Sparse Matrix-Vector Multiplication (SMVM) is a pervasive operation in many scientific and engineering applications. Moreover, SMVM is a computational intensive operation that dominates the performance in most iterative linear system solvers. There are some optimization challenges in computations involving SMVM due to its high memory access rate and irregular memory access pattern. In this thesis, a new design concept for SMVM in an FPGA by using Network-on-Chip (NoC) is presented. In traditional circuit design on-chip communications have been designed with dedicated point-to-point interconnections or shared buses. Therefore, regular data transfer is the major concern of many parallel implementations. However, when dealing with the SMVM operation, the required data transfers are usually dependent on the sparsity structure of the matrix and can be extremely irregular. Using an NoC architecture makes it possible to deal with arbitrary structure of the data transfers, i.e. with arbitrary structured sparse matrices. In addition, the size of the pipelined SMVM calculator based on NoC architecture can be customized to 2×2, 4×4, ..., p×p (p∈N) due to its high scalibility and flexibility. The implementation is done in IEEE-754 single floating-point precision on the Xilinx Virtex-6 FPGA. The experimental results show that the proposed NoC-based implementation can achieve approximate 2.3 - 5.6 speed-up over the MATLAB-based software implementation in Matrix Market benchmark applications.

    Table of Contents iv List of Tables vi List of Figures vii Abstract x 1 Introduction 1 2 Related Works 5 3 Background 10 3.1 Finite Element Method 10 3.2 Conjugate Gradient Solver 11 3.3 Page Rank Algorithm 12 4 Network-on-Chip 15 4.1 Network Interface 16 4.2 Basic Switching Techniques 17 4.2.1 Circuit Switching 18 4.2.2 Packet Switching 18 4.2.3 Virtual Cut-Through Switching 19 4.2.4 Wormhole Switching 20 5 SMVM-NoC 22 5.1 Basic Idea 22 5.2 Mapping the SMVM to NoC 23 5.3 Partition Method for High Order Matrix 27 5.4 NoC Switch Design 30 5.5 NoC Routing Algorithm 32 5.6 NoC Processing Element Design 33 6 Experimental Results 35 6.1 Sparsity and Nonzero Entries Influence 36 6.2 Pipelined versus Nonpipelined Design 38 6.3 Impact of Varying PE Pipeline Depths 39 6.4 Synchronous versus Asynchronous Design 40 6.5 Performance Benchmarking 43 7 Conclusion 51 Bibliography 53

    [1] Y. El-Kurdi, D. Fern’andez, E. Souleimanov, D. Giannacopoulos, and W. J. Gross, “FPGA Architecture and Implementation of Sparse Matrix-Vector Multiplication for the Finite Element Method,” Computer Physics Communications, vol. 178, pp. 558–570, Apr. 2008.
    [2] V. Prasanna and G. Morris, “Sparse Matrix Computations on Reconfigurable Hardware,” Computer, vol. 40, no. 3, pp. 58–64, Mar. 2007.
    [3] A. Langville, C. Meyer, and P. Fern’andez, “Google’s Pagerank and Beyond: The Science of Search Engine Rankings,” The Mathematical Intelligencer, vol. 30, pp. 68–69, 2008.
    [4] V. Taylor, A. Ranade, and D. Messerschmitt, “SPAR: A New Architecture for Large Finite Element Computations,” IEEE Transactions on Computers, vol. 44, no. 4, pp. 531–545, Apr. 1995.
    [5] R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee, “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” in Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, 2002, p. 26.
    [6] F. V’azquez, J. Fern’andez, and E. Garz’on, “A New Approach for Sparse Matrix Vector Product on NVIDIA GPUs,” Concurrency Computation Practice and Experience, vol. 23, no. 8, pp. 815–826, 2011.
    [7] Z. Wang, X. Xu, W. Zhao, Y. Zhang, and S. He, “Optimizing Sparse Matrix-Vector Multiplication on CUDA,” in 2nd International Conference on Education Technology and Computer (ICETC), vol. 4, Jun. 2010, pp. V4109–V4113.
    [8] S. Georgescu and H. Okuda, “Conjugate Gradients on Multiple GPUs,” International Journal for Numerical Methods in Fluids, vol. 64, no. 10-12, pp. 1254–1273, 2010.
    [9] A. El Zein and A. Rendell, “From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation,” in 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), May 2010, pp. 808–813.
    [10] N. Kapre and A. DeHon, “Optimistic Parallelization of Floating-Point Accumulation,” in IEEE Symposium on Computer Arithmetic, Jun. 2007, pp. 205–216.
    [11] L. Benini and G. D. Micheli, “Networks on Chips: A New SoC Paradigm,”Computer, vol. 35, no. 1, pp. 70–78, Jan. 2002.
    [12] D. Bertozzi and L. Benini, “Xpipes: A Network-on-Chip Architecture for Gigascale Systems-on-Chip,” IEEE Circuits and Systems Magazine, vol. 4, no. 2, pp. 18–31, Sep. 2004.
    [13] Y. Censor, D. Gordon, and R. Gordon, “Component Averaging: An Efficient Iterative Parallel Algorithm for Large and Sparse Unstructured Problems,” Parallel Computing, vol. 27, no. 6, pp. 777–808, 2001.
    [14] D.-H. Li, Y.-Y. Nie, J.-P. Zeng, and Q.-N. Li, “Conjugate Gradient Method for the Linear Complementarity Problem with S-Matrix,” Mathematical and Computer Modelling, vol. 48, pp. 918–928, Sep. 2008.
    [15] S. McGettrick, D. Geraghty, and C. McElroy, “An FPGA Architecture for the PageRank Eigenvector Problem,” in International Conference on Field Programmable Logic and Applications, Sep. 2008, pp. 523–526.
    [16] A. Kohler, G. Schley, and M. Radetzki, “Fault Tolerant Network on Chip Switching With Graceful Performance Degradation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 29, no. 6, pp. 883–896, Jun. 2010.
    [17] F. Vitullo, N. L’Insalata, E. Petri, S. Saponara, L. Fanucci, M. Casula, R. Locatelli, and M. Coppola, “Low-Complexity Link Microarchitecture for Mesochronous Communication in Networks-on-Chip,” IEEE Transactions on Computers, vol. 57, no. 9, pp. 1196–1201, Sep. 2008.
    [18] E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander, “Trade-offs in the Design of a Router with both Guaranteed and Best-Effort Services for Networks on Chip,” in Design, Automation and Test in Europe Conference and Exhibition, 2003, pp. 350–355.
    [19] K. S. Sainarayanan, C. Raghunandan, and M. Srinivas, “Delay and Power Minimization in VLSI Interconnects with Spatio-Temporal Bus-Encoding Scheme,”in IEEE Computer Society Annual Symposium on VLSI, Mar. 2007, pp. 401–408.
    [20] S. Pasricha and N. Dutt, On-Chip Communication Architectures: System on Chip Interconnect. Morgan Kaufmann, 2008.
    [21] A. B. Kahng, “Scaling: More than Moore’s law,” IEEE Design Test of Computers, vol. 27, no. 3, pp. 86–87, May 2010.
    [22] R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet, G. Lemieux, P. Pande, C. Grecu, and A. Ivanov, “System-on-Chip: Reuse and Integration,” Proceedings of the IEEE, vol. 94, no. 6, pp. 1050–1069, june 2006.
    [23] K. Lee, S.-J. Lee, and H.-J. Yoo, “Low-Power Network-on-Chip for High-Performance SoC Design,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 14, no. 2, pp. 148–160, Feb. 2006.
    [24] W. Wolf, “The Future of Multiprocessor Systems-on-Chips,” in Design Automation Conference, Jul. 2004, pp. 681–685.
    [25] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oeberg, M. Millberg, and D. Lindquist, “Network on a Chip: An Architecture for Billion Transistor Era,”in Proceeding of the IEEE NorChip Conference, Nov. 2000, pp. 24–31.
    [26] J. Hu and R. Marculescu, “DyAD - Smart Routing for Networks-on-Chip,” in Design Automation Conference, Jun. 2004, pp. 260–263.
    [27] J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. Yousif, and C. Das, “A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks,” in International Symposium on Computer Architecture, 2006, pp. 4–15.
    [28] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach. Morgan Kaufmann, 2003.
    [29] Xilinx, “http://www.xilinx.com.”
    [30] Matrix Market, “http://math.nist.gov/MatrixMarket/,” National Institute of Standards and Technology (NIST).

    QR CODE