基於深度可分離卷積運算之高效能神經網路電路設計與實現

簡易檢索 / 詳目顯示

回結果列表

研究生：	林泓儒 Hung-Ju Lin
論文名稱：	基於深度可分離卷積運算之高效能神經網路電路設計與實現 The Efficient VLSI Design and Implementation of Neural Networks Based on Depthwise Separable Convolution
指導教授：	沈中安 Chung-An Shen
口試委員:	郭景明 Jing-Ming Guo 吳晉賢 Chin-Hsien Wu 林昌鴻 Chang-Hong Lin 沈中安 Chung-An Shen
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2018
畢業學年度：	106
語文別：	英文
論文頁數：	36
中文關鍵詞：	卷積神經網路、特殊應用積體電路加速器、深度可分離卷積、高吞吐量、低硬體複雜度
外文關鍵詞：	Convolutional Neural Network (CNN), Application Specific Integrated Circuit (ASIC) Accelerator, Depthwise Separable Convolution, High Throughput, Low Complexity
相關次數：	點閱：282 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

這篇論文針對深度可分離卷積神經網路設計了一個超大型積體電路架構，更精細地來說，根據我們的文獻調查，這篇論文是第一篇針對深度可分離之捲積運算神經網路: MobileNet 設計的硬體加速電路。在我們的設計中，為了達到高吞吐量並維持低硬體複雜度，我們提出了一個極具效率的資料流，使得電路和晶片外記憶體之資料傳輸量得以大幅度地減少，並使得我們提出的架構得以重複使用已讀取之資料、進而避免使用過多的儲存元件而造成的硬體複雜度上升; 此外、為了達到高吞吐量，基於提出之資料流，我們的架構也使用了高度流水線設計、進而達到高吞吐量。最後，我們以台積電90奈米製程實作了我們提出的架構，實驗結果顯示我們的架構可以達到33.514億次的乘加運算、並且只有6340千個邏輯閘，和當下最具代表性之硬體加速電路: Eyeriss [13]相比，我們的電路不僅是他們的5倍之快、面積更只有他們的70%，和其他文獻相比、我們的架構更是有著高吞吐量以及低硬體複雜度的特色。

This thesis presents the efficient VLSI architecture design and circuit implementation for a Neural Network based on the depthwise separable convolution. The design proposed in this thesis, to the best of the knowledge, depicts the first hardware accelerator for the inference of the MobileNet, a Neural Network on the basis of the depthwise separable convolution scheme. In particular, in order to achieve high throughput while still maintaining low area complexity, a novel data-processing flow is proposed so that the amount of data accesses with the off-chip DRAM is significantly reduced. Furthermore, the proposed architecture enjoys high degree of data reuse without utilizing excessive amounts of storing buffers. Therefore the area complexity incurred by the storage elements is largely mitigated. Based on the proposed data-processing flow and the data reuse scheme, a highly pipelined architecture is designed aiming at achieving high processing throughput. The implemented circuit is synthesized with TSMC 90nm technology and the evaluations for the performance and area complexity have been conducted based on the post-synthesized estimations. The experimental results show that the proposed architecture achieves a throughput of 33.514 Giga-MACs with the hardware complexity of 6340 KGEs excluding the highly technology dependent memory buffers. Compared to the state-of-the art design, the propose architecture achieves a 5× enhancements in speed and leads to approximately 30% reductions in area complexity.

Table of Contents
摘要    II
Abstract    III
誌謝    IV
Table of Contents    V
Figures    VII
Tables    IX
I.    Introduction of Convolutional Neural Network and the Hardware Accelerator    1
II.    Background and Literature Survey    4
2.1    Classic CNN Model and its Convolutions    4
2.1.1    An Overview of CNN Model    4
2.1.2    AlexNet and Standard Convolution    5
2.1.3    MobileNet and Depthwise Separable Convolution    7
2.1.4    Convolution-Complexity Comparison    8
2.1.5    Performance of CNN Model on ARM Platform    9
III.    Data Flow: Analysis and Proposed    10
3.1    Analysis of CNN Hardware Architecture    10
3.1.1    Hardware Model for Evaluating Latency and Power    10
3.1.2    Weight Stationary    12
3.1.3    Output Stationary    13
3.1.4    Row Stationary    14
3.2    Proposed: Cross Data Flow    15
3.2.1    Parallel Row Stationary Data Flow for Depthwise Convolution    16
3.2.2    Parallel Output Stationary Data Flow for Pointwise Convolution    18
IV.    Proposed Architecture for Accelerating CNN    21
4.1    An Overview of Proposed Architecture    21
4.2    Processing Element (PE)    22
4.2.1    PE while Processing Depthwise Convolution    23
4.2.2    PE while Processing Pointwise Convolution    24
4.3    Switching Module    25
4.4    Mapping MobileNet on Proposed Architecture    26
4.5    Memory Fetching Module    27
V.    Experimental Results    29
VI.    Conclusion    34
References    35


                                

[1] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015): 436.
[2] Li, Jianan, et al. "Scale-aware fast R-CNN for pedestrian detection." IEEE Transactions on Multimedia 20.4 (2018): 985-996.
[3] Yang, Yi, et al. "Towards real-time traffic sign detection and classification." IEEE Transactions on Intelligent Transportation Systems 17.7 (2016): 2022-2031.
[4] Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.
[5] Sze, Vivienne, et al. "Efficient processing of deep neural networks: A tutorial and survey." Proceedings of the IEEE 105.12 (2017): 2295-2329.
[6] Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 2017.
[7] Bojarski, Mariusz, et al. "End to end learning for self-driving cars." arXiv preprint arXiv:1604.07316 (2016).
[8] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).
[9] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv:1602.07360 (2016).
[10] Zhang, Xiangyu, et al. "Shufflenet: An extremely efficient convolutional neural network for mobile devices." arXiv preprint arXiv:1707.01083 (2017).
[11] Moons, Bert, et al. "14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi." Solid-State Circuits Conference (ISSCC), 2017 IEEE International. IEEE, 2017.
[12] Sze, Vivienne. "Designing Hardware for Machine Learning: The Important Role Played by Circuit Designers." IEEE Solid-State Circuits Magazine 9.4 (2017): 46-54.
[13] Chen, Yu-Hsin, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." IEEE Journal of Solid-State Circuits 52.1 (2017): 127-138.
[14] Guo, Kaiyuan, et al. "Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37.1 (2018): 35-47.
[15] NCNN: A CNN Framwork for Mobile Devices from Tencent, https://github.com/Tencent/ncnn
[16] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
[17] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
[18] Sifre, Laurent, and P. S. Mallat. Rigid-motion scattering for image classification. Diss. PhD thesis, Ph. D. thesis, 2014.
[19] V. Nair, G. E. Hinton, "Rectified linear units improve restricted boltzmann machines", Proc. ICML, pp. 807-814, 2010.
[20] A. L. Maas, A. Y. Hannun, A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models", Proc. ICML, pp. 1-6, 2013.
[21] K. He, X. Zhang, S. Ren, J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification", Proc. ICCV, pp. 1026-1034, 2015.
[22] Szegedy, Christian, et al. "Going deeper with convolutions." Cvpr, 2015.
[23] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
[24] An Overview of ARM Architecture, https://developer.arm.com/technologies
[25] Y.-H. Chen, J. Emer, V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks", Proc. ISCA, pp. 367-379, 2016.
[26] Y.-H. Chen, J. Emer, V. Sze, "Using dataflow to optimize energy efficiency of deep neural network accelerators", IEEE Micro, vol. 37, no. 3, pp. 21, May/Jun. 2017.\
[27] S. Chakradhar, M. Sankaradas, V. Jakkula, S. Cadambi, "A dynamically configurable coprocessor for convolutional neural networks", Proc. ISCA, pp. 247-257, 2010.
[28] V. Gokhale, J. Jin, A. Dundar, B. Martini, E. Culurciello, "A 240 G-ops/s mobile coprocessor for deep neural networks", CVPR Workshop, pp. 682-687, 2014.
[29] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, "Deep learning with limited numerical precision", Proc. ICML, pp. 1737-1746, 2015.
[30] Z. Du et al., "ShiDianNao: Shifting vision processing closer to the sensor", Proc. ISCA, pp. 92-104, 2015.
[31] M. Peemen, A. A. A. Setio, B. Mesman, H. Corporaal, "Memory-centric accelerator design for convolutional neural networks", Proc. ICCD, pp. 13-19, 2013.
[32] Chen, Yu-Hsin, Joel Emer, and Vivienne Sze. "Using dataflow to optimize energy efficiency of deep neural network accelerators." IEEE Micro 37.3 (2017): 12-21.
[33] Desoli, Giuseppe, et al. "14.1 A 2.9 TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems." Solid-State Circuits Conference (ISSCC), 2017 IEEE International. IEEE, 2017.
[34] Ardakani, Arash, et al. "An Architecture to Accelerate Convolution in Deep Neural Networks." IEEE Transactions on Circuits and Systems I: Regular Papers 65.4 (2018): 1349-1362.
[35] Malladi, Krishna T., et al. "Towards energy-proportional datacenter memory with mobile DRAM." Computer Architecture (ISCA), 2012 39th Annual International Symposium on. IEEE, 2012.

全文公開日期 2023/08/27 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文