簡易檢索 / 詳目顯示

研究生: 林泓儒
Hung-Ju Lin
論文名稱: 基於深度可分離卷積運算之高效能神經網路電路設計與實現
The Efficient VLSI Design and Implementation of Neural Networks Based on Depthwise Separable Convolution
指導教授: 沈中安
Chung-An Shen
口試委員: 郭景明
Jing-Ming Guo
吳晉賢
Chin-Hsien Wu
林昌鴻
Chang-Hong Lin
沈中安
Chung-An Shen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 36
中文關鍵詞: 卷積神經網路特殊應用積體電路加速器深度可分離卷積高吞吐量低硬體複雜度
外文關鍵詞: Convolutional Neural Network (CNN), Application Specific Integrated Circuit (ASIC) Accelerator, Depthwise Separable Convolution, High Throughput, Low Complexity
相關次數: 點閱:282下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 這篇論文針對深度可分離卷積神經網路設計了一個超大型積體電路架構,更精細地來說,根據我們的文獻調查,這篇論文是第一篇針對深度可分離之捲積運算神經網路: MobileNet 設計的硬體加速電路。在我們的設計中,為了達到高吞吐量並維持低硬體複雜度,我們提出了一個極具效率的資料流,使得電路和晶片外記憶體之資料傳輸量得以大幅度地減少,並使得我們提出的架構得以重複使用已讀取之資料、進而避免使用過多的儲存元件而造成的硬體複雜度上升; 此外、為了達到高吞吐量,基於提出之資料流,我們的架構也使用了高度流水線設計、進而達到高吞吐量。最後,我們以台積電90奈米製程實作了我們提出的架構,實驗結果顯示我們的架構可以達到33.514億次的乘加運算、並且只有6340千個邏輯閘,和當下最具代表性之硬體加速電路: Eyeriss [13]相比,我們的電路不僅是他們的5倍之快、面積更只有他們的70%,和其他文獻相比、我們的架構更是有著高吞吐量以及低硬體複雜度的特色。


    This thesis presents the efficient VLSI architecture design and circuit implementation for a Neural Network based on the depthwise separable convolution. The design proposed in this thesis, to the best of the knowledge, depicts the first hardware accelerator for the inference of the MobileNet, a Neural Network on the basis of the depthwise separable convolution scheme. In particular, in order to achieve high throughput while still maintaining low area complexity, a novel data-processing flow is proposed so that the amount of data accesses with the off-chip DRAM is significantly reduced. Furthermore, the proposed architecture enjoys high degree of data reuse without utilizing excessive amounts of storing buffers. Therefore the area complexity incurred by the storage elements is largely mitigated. Based on the proposed data-processing flow and the data reuse scheme, a highly pipelined architecture is designed aiming at achieving high processing throughput. The implemented circuit is synthesized with TSMC 90nm technology and the evaluations for the performance and area complexity have been conducted based on the post-synthesized estimations. The experimental results show that the proposed architecture achieves a throughput of 33.514 Giga-MACs with the hardware complexity of 6340 KGEs excluding the highly technology dependent memory buffers. Compared to the state-of-the art design, the propose architecture achieves a 5× enhancements in speed and leads to approximately 30% reductions in area complexity.

    Table of Contents 摘要 II Abstract III 誌謝 IV Table of Contents V Figures VII Tables IX I. Introduction of Convolutional Neural Network and the Hardware Accelerator 1 II. Background and Literature Survey 4 2.1 Classic CNN Model and its Convolutions 4 2.1.1 An Overview of CNN Model 4 2.1.2 AlexNet and Standard Convolution 5 2.1.3 MobileNet and Depthwise Separable Convolution 7 2.1.4 Convolution-Complexity Comparison 8 2.1.5 Performance of CNN Model on ARM Platform 9 III. Data Flow: Analysis and Proposed 10 3.1 Analysis of CNN Hardware Architecture 10 3.1.1 Hardware Model for Evaluating Latency and Power 10 3.1.2 Weight Stationary 12 3.1.3 Output Stationary 13 3.1.4 Row Stationary 14 3.2 Proposed: Cross Data Flow 15 3.2.1 Parallel Row Stationary Data Flow for Depthwise Convolution 16 3.2.2 Parallel Output Stationary Data Flow for Pointwise Convolution 18 IV. Proposed Architecture for Accelerating CNN 21 4.1 An Overview of Proposed Architecture 21 4.2 Processing Element (PE) 22 4.2.1 PE while Processing Depthwise Convolution 23 4.2.2 PE while Processing Pointwise Convolution 24 4.3 Switching Module 25 4.4 Mapping MobileNet on Proposed Architecture 26 4.5 Memory Fetching Module 27 V. Experimental Results 29 VI. Conclusion 34 References 35

    [1] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015): 436.
    [2] Li, Jianan, et al. "Scale-aware fast R-CNN for pedestrian detection." IEEE Transactions on Multimedia 20.4 (2018): 985-996.
    [3] Yang, Yi, et al. "Towards real-time traffic sign detection and classification." IEEE Transactions on Intelligent Transportation Systems 17.7 (2016): 2022-2031.
    [4] Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.
    [5] Sze, Vivienne, et al. "Efficient processing of deep neural networks: A tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329.
    [6] Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 2017.
    [7] Bojarski, Mariusz, et al. "End to end learning for self-driving cars." arXiv preprint arXiv:1604.07316 (2016).
    [8] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).
    [9] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv:1602.07360 (2016).
    [10] Zhang, Xiangyu, et al. "Shufflenet: An extremely efficient convolutional neural network for mobile devices." arXiv preprint arXiv:1707.01083 (2017).
    [11] Moons, Bert, et al. "14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi." Solid-State Circuits Conference (ISSCC), 2017 IEEE International. IEEE, 2017.
    [12] Sze, Vivienne. "Designing Hardware for Machine Learning: The Important Role Played by Circuit Designers." IEEE Solid-State Circuits Magazine 9.4 (2017): 46-54.
    [13] Chen, Yu-Hsin, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." IEEE Journal of Solid-State Circuits 52.1 (2017): 127-138.
    [14] Guo, Kaiyuan, et al. "Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37.1 (2018): 35-47.
    [15] NCNN: A CNN Framwork for Mobile Devices from Tencent, https://github.com/Tencent/ncnn
    [16] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
    [17] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
    [18] Sifre, Laurent, and P. S. Mallat. Rigid-motion scattering for image classification. Diss. PhD thesis, Ph. D. thesis, 2014.
    [19] V. Nair, G. E. Hinton, "Rectified linear units improve restricted boltzmann machines", Proc. ICML, pp. 807-814, 2010.
    [20] A. L. Maas, A. Y. Hannun, A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models", Proc. ICML, pp. 1-6, 2013.
    [21] K. He, X. Zhang, S. Ren, J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification", Proc. ICCV, pp. 1026-1034, 2015.
    [22] Szegedy, Christian, et al. "Going deeper with convolutions." Cvpr, 2015.
    [23] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
    [24] An Overview of ARM Architecture, https://developer.arm.com/technologies
    [25] Y.-H. Chen, J. Emer, V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks", Proc. ISCA, pp. 367-379, 2016.
    [26] Y.-H. Chen, J. Emer, V. Sze, "Using dataflow to optimize energy efficiency of deep neural network accelerators", IEEE Micro, vol. 37, no. 3, pp. 21, May/Jun. 2017.\
    [27] S. Chakradhar, M. Sankaradas, V. Jakkula, S. Cadambi, "A dynamically configurable coprocessor for convolutional neural networks", Proc. ISCA, pp. 247-257, 2010.
    [28] V. Gokhale, J. Jin, A. Dundar, B. Martini, E. Culurciello, "A 240 G-ops/s mobile coprocessor for deep neural networks", CVPR Workshop, pp. 682-687, 2014.
    [29] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, "Deep learning with limited numerical precision", Proc. ICML, pp. 1737-1746, 2015.
    [30] Z. Du et al., "ShiDianNao: Shifting vision processing closer to the sensor", Proc. ISCA, pp. 92-104, 2015.
    [31] M. Peemen, A. A. A. Setio, B. Mesman, H. Corporaal, "Memory-centric accelerator design for convolutional neural networks", Proc. ICCD, pp. 13-19, 2013.
    [32] Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. "Using dataflow to optimize energy efficiency of deep neural network accelerators." IEEE Micro 37.3 (2017): 12-21.
    [33] Desoli, Giuseppe, et al. "14.1 A 2.9 TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems." Solid-State Circuits Conference (ISSCC), 2017 IEEE International. IEEE, 2017.
    [34] Ardakani, Arash, et al. "An Architecture to Accelerate Convolution in Deep Neural Networks." IEEE Transactions on Circuits and Systems I: Regular Papers 65.4 (2018): 1349-1362.
    [35] Malladi, Krishna T., et al. "Towards energy-proportional datacenter memory with mobile DRAM." Computer Architecture (ISCA), 2012 39th Annual International Symposium on. IEEE, 2012.

    無法下載圖示 全文公開日期 2023/08/27 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE