簡易檢索 / 詳目顯示

研究生: 鄭又瑄
Yu-Hsuan Cheng
論文名稱: 設計與實現一個植基於FPGA的列固定資料流CNN加速器
Design and Implementation of an FPGA-Based CNN Accelerator with Row-Stationary Data Flow
指導教授: 林銘波
Ming-Bo Lin
口試委員: 陳郁堂
Yie-Tarng Chen
林書彥
Shu-Yen Lin
蔡政鴻
Cheng-Hung Tsai
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 56
中文關鍵詞: 卷積神經網路平行化加速器現場可程式邏輯陣列手寫辨識深度學習
外文關鍵詞: Convolutional neural network, Parallelism, Accelerators, FPGA, handwritten-digit recognition, Deep learning
相關次數: 點閱:270下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,卷積神經網路(CNN)已經成功應用於日常生活中,尤其在物件追蹤、臉部辨識、圖片分類等在電腦視覺上都取得優異的表現。隨著卷積神經網路的發展日趨成熟,許多加速器也陸續地被提出。然而,如何利用有限的資源將其實現於嵌入式系統仍是一大挑戰。此外,由於卷積神經網路模型相當複雜,運算過程需要存取的參數相當多,因而往往導致存取參數與資料時的功耗較計算時為多。因此,降低參數與資料的移動次數為降低功耗及提高吞吐量的一個重要因素。
    為達到上述目的,在本論文中我們利用列固定(Row-stationary)資料流實現一個卷積神經網路加速器。其中列固定資料流的主要觀念為利用局部資料共享,減少資料的存取次數與移動次數。據此,我們設計一個特殊的PE (Process element)陣列,以管線與平行運算的方式計算需要的卷積和。為了提高效能,本論文亦提出了三項改善:其一、加入了點卷積層來縮減通道數,並移除全連接層,以減少模型整體的參數量。其二、降低二維PE 陣列的大小,透過通道(channel)平行運算,讓資料共享最大化,以提高吞吐量並降低功耗。其三、使用管線與平行技術設計加速器電路,以提升電路執行速度。此外,稍加修改控制電路後,提出的加速器可以操作於管線運算,因而可以更進一步提升性能。
    設計完成的卷積神經網路加速器已經在Xilinx的Virtex 6系列的FPGA (xc6vcx75t)上實現與驗證。它一共消耗了15995個 LUT 與14089個暫存器及116個DSP48E1s,工作頻率為110 MHz,CNN核心運算速度為軟體的4.8x,最高吞吐量6.16 GMACS,於MNIST資料庫測試集精準度98.15%。


    In recent years, neural networks have been widely used for various daily-life applications; in particular, they have achieved great success in computer vision tasks, such as object detection, face recognition, and image classification. As the development of CNNs becomes more and more mature, many accelerators have been proposed one after another during the last decade. Nevertheless, how to implement a CNN with limited resources inherently associated with embedded systems is a major challenge. Because the CNN model is very complicated, a great amount of parameters are required in the computation process, thereby, creating a significant amount of data movements and consuming much more energy than the part of computation. As a consequence, minimizing the amount of data movements and hence reducing the power consumed is the vital factor to achieve high throughput and energy efficiency.
    To achieve the aforementioned goal, in this thesis, we design and implement a CNN accelerator using the row-stationary (RS) dataflow. The rationale behind the RS dataflow is by maximally reusing data locally so as to reduce the amount of data movements and hence to optimize energy efficiency. Based on this, a special process element (PE) array is designed to realize the required convolution operations in a pipeline and parallel fashion. In order to promote performance, the following three improvements are proposed: First, a point-wise convolution layer is added to reduce the number of channels and the fully connected layer is removed to reduce the amount of parameters. Second, through reducing the size of the PE array and maximally reusing data via channel parallel computing, high throughput and power consumption reduction can be achieved. Third, both pipeline and parallelism techniques are used in the accelerator to speed up the circuit. In addition, with only a minor modification the proposed architecture can be readily adapted into a stage pipeline structure so as to further promote the performance.
    The proposed CNN accelerator has been implemented and verified with an FPGA device(xc6vcx75t) of the Xilinx Vertex 6 family. It consumes 15995 LUTs and 14089 registers as well as 116 DSP48E1s. As the CNN operates at 110 MHz, its performance is 4.8x of the software. The peak throughput is 6.16 GMACS. It achieve 98.15% accuracy on MNIST test sets.

    第 1 章 緒論 1.1 研究動機 1.2 研究方向 1.3 章節安排 第 2 章 卷積神經網路介紹 2.1 卷積神經網路專有名詞介紹 2.1.1 卷積神經網路 2.1.2 卷積層 2.1.3 卷積核 2.1.4 偏差 2.1.5 步伐 2.1.6 激活函數 2.1.7 池化層 2.1.8 全連接層 2.2 LeNet 2.3 MobileNet 2.4 MNIST dataset 第 3 章 CNN硬體架構分析與設計 3.1 本論文使用模型 3.1.1 模型建立 3.1.2 模型架構 3.2 資料流分析 3.3 脈動陣列 3.4 運算矩陣分析 第 4 章 CNN加速器設計與實現 4.1 CNN加速器架構 4.1.1 卷積運算模組 4.1.2 最大池化模組 4.1.3 ReLU模組 4.1.4 點卷積運算模組 4.1.5 卷積加法器 37 4.1.6 One-hot編碼器 4.1.7 主控制器模組 第 5 章 FPGA設計與實現 5.1 FPGA設計的與實現流程 5.2 測試與驗證 5.2.1 軟體測試環境 5.3 FPGA模擬結果 5.4 FPGA硬體資源使用 5.5 FPGA效能分析 第 6 章 結論 參考文獻

    [1] C. C. Aggarwal, Neural Networks and Deep Learning. New York, USA: Springer, 2018.
    [2] S. Chakradhar and M. Sankaradas, “A dynamically configurable coprocessor for convolutional neural networks,” ACM SIGARCH Computer Architecture News, 38(3), pp. 247--257, Jun. 2010.
    [3] T. Chen and Z. Du, “DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning,” ACM SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 269--284, Feb. 2014.
    [4] Y.-H. Chen et al., “Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127--138, Jan. 2017.
    [5] Y. Chen and Y. Luo, “A machine-learning supercomputer,” in Proceedings of the 47th Annual IEEE International Symposium on Microarchitecture (MICRO 47), Cambridge, UK, pp. 609--622, Dec. 2014.
    [6] Y.-H. Chen et al., “Using dataflow to optimize energy efficiency of deep neural network accelerators,” IEEE Micro, vol 37, pp. 12--21, Jun. 2017.
    [7] Z. Du and R. Fasthuber, “ShiDianNao: shifting vision processing closer to the sensor,” in Proceedings of the IEEE 42nd Annual International Symposium on Computer Architecture (ISCA 15), Portland, OR, pp. 92--104, Jun. 2015.
    [8] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2015.
    [9] A. G. Howard et al., “Mobilenets: efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
    [10] S. Gupta et al., “Deep learning with limited numerical precision,” in Proceedings of the International Conference on Applications and Techniques in Cyber Security and Intelligence (ATCSI 2017), Ningbo, China, pp 42--50, Aug. 2017.
    [11] S. Ghaffari, S. Sharifian, “FPGA-based convolution neural network accelerator design using high level synthesize,” in Proceedings of the 2nd International Conference of Signal Processing and Intelligent Systems (ICSPIS 2016), Tehran, Iran, pp. 1--6, Dec. 2016.
    [12] D. Hubel and T. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” The Journal of Physiology, 148(3), pp. 574--591, Oct. 1959.
    [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS 12), Nevada, USA, pp. 1097--1105, Dec. 2012.
    [14] A. Kyriakos et al., “High performance accelerator for cnn applications,” in Proceedings of the 29th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS 2019), Rhodes, Greece, pp. 135--140, Jul. 2019.
    [15] H. Li et al., “A high performance FPGA-based accelerator for large-scalen convolutional neural networks,” in Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL2016), Lausanne, Switzerland, pp. 1--9, Aug. 2016.
    [16] D. Liu and T. Chen, “PuDianNao: a polyvalent machine learning accelerator,” ACM SIGARCH Computer Architecture News, vol. 43, no. 1, pp. 369-381, May 2015.
    [17] Y. LeCun et al., “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278--2324, Nov. 1998.
    [18] M. Peemen et al., “Memory-centric accelerator design for convolutional neural networks,” in Proceedings of the IEEE 31st International Conference on Computer Design (ICCD 2013), Asheville, NC, pp. 13--19, Oct. 2013.
    [19] J. Qiu and J. Wang, “Going deeper with embedded FPGA platform for convolutional neural network,” in Proceedings of The 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, California, pp. 26--35, Feb. 2016.
    [20] M. Sankaradas et al., “A massively parallel coprocessor for convolutional neural networks,” in Proceedings of the 20th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2009), Boston, MA, pp. 53--60, Jul. 2009.
    [21] N. Suda and V. Chandra, “Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks,” in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, California, USA, pp. 16--25, Feb. 2016.
    [22] V. Sriram et al., “Towards an embedded biologically-inspired machine vision processor,” in Proceedings of the International Conference on Field-Programmable Technology (FPT 2010), Beijing, China, pp. 273--278, Dec. 2010.
    [23] V. Sze et al., “Hardware for machine learning: challenges and opportunities,” in Proceedings of the IEEE Custom Integrated Circuits Conference (CICC 2017), Austin, TX, pp. 1--8, Apr. 2017.
    [24] C. Zhang and P. Li, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, California, pp. 161--170, Feb. 2015.
    [25] Y. Zhou and J. Jiang, “An FPGA-based accelerator implementation for deep convolutional neural networks,” in Proceedings of the 4th International Conference on Computer Science and Network Technology (ICCSNT 2015), Harbin, China, pp. 829--832, Dec. 2015.
    [26] Y. Zhou, W. Wang, and X. Huang, “FPGA design for PCANet deep learning network,” in Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, Vancouver, BC, pp. 232--232, May 2015.
    [27] MobileNet Reference Pages, https://medium.com/@chih.sheng.huang821/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-mobilenet-depthwise-separable-convolution-f1ed016b3467
    [28] One-hot Reference Pages, URL: https://zh.wikipedia.org/wiki/One-hot
    [29] Softmax Reference Pages, URL: https://medium.com/%E6%89%8B%E5%AF%AB%E7%AD%86%E8%A8%98/%E4%BD%BF%E7%94%A8-tensorflow-%E5%AD%B8%E7%BF%92-softmax-%E5%9B%9E%E6%AD%B8-softmax-regression-41a12b619f04
    [30] THE MNIST DATABASE, URL: http://yann.lecun.com/exdb/mnist/
    [31] 鄭侑廷(2017)。使用全卷積神經網路應用於肝臟及其病變圖像分割(博碩士論文)。https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/ccd=541Ipy/fqaresult

    QR CODE