針對邊緣運算上卷積神經網路推導之低累加次數的移位殘差相加布氏乘法器

簡易檢索 / 詳目顯示

回結果列表

研究生：	巫忠達 Zhong-Da Wu
論文名稱：	針對邊緣運算上卷積神經網路推導之低累加次數的移位殘差相加布氏乘法器 Accumulation-Aware Shift and Difference-Add Booth Multiplier for Convolutional Neural Networks Inference Targeting on Edge Computing
指導教授：	阮聖彰 Shanq-Jang Ruan
口試委員:	蔡宗漢 Tsung-Han Tsai 李佩君 Pei-Jun Lee 沈中安 Chung-An Shen
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	74
中文關鍵詞：	卷積神經網路、卷積神經網路加速器、布氏乘法器、資料複用資料流
外文關鍵詞：	convolutional neural networks, CNN accelerators, booth multiplier, reused data flow
相關次數：	點閱：366 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，卷積神經網絡由於在提取複雜特徵上有著突出的表現，而被應用於許多領域。然而這些神經網路模型雖然強大但是伴隨著大量運算複雜度。因此有大量研究探討各種神經網路加速器架構和資料流以提升吞吐量和優化能源效率。本文提出了一種混合複用資料流和一種改良版的布氏乘法器以降低能源消耗。評估方式使用批次大小為3的預訓練VGG16模型當作標準。評估結果表明，與先前研究先比，提出的布氏乘法器中的狀態切換次數減少了1.96倍，並且將動態隨機存取記憶體(DRAM)和資料緩衝器存取資料量分別減少至先前研究的92.6％和69.6％。

In recent years, the convolutional neural networks (CNNs) have been applied to many fields due to its high performance for extracting complex features. However, these CNNs models are robust but come at the cost of lots of computational complexity. As a result, a bunch of studies researches the various architecture and data flow for optimizing the throughput and energy-efficient. This thesis presents a hybrid reused data flow and a modified booth multiplier to reduce energy consumption. The evaluation result uses the pre-trained VGG16 model with a batch size of three as a benchmark. The result shows that the proposed design reduces the number of state toggles in the booth multiplier by 1.96 times and reduces the DRAM and global buffer accesses to 92.6% and 69.6% as prior work respectively.

Recommendation Form                                    I
Committee Form                                         II
Chinese abstract                                       III
English abstract                                       IV
Acknowledgments                                        V
Table of Contents                                      VII
List of Tables                                         IX
List of Figures                                        X
Chapter 1    Introduction                              1
1.1.    Introduction of convolutional accelerator      1
1.2.    Challenges of existing works                   3
1.3.    Contributions of this thesis                   4
1.4.    Organization                                   6
Chapter 2    Background                                7
2.1.    The CNN algorithm                              7
2.2.    The booth multiplier                           17
Chapter 3    Related works                             22
3.1.    Quantization neural network                    22
3.2.    CNN accelerators                               24
3.3.    Optimized MAC units                            25
Chapter 4    Proposed data flow and analysis           27
4.1.    The hybrid reused data flow                    27
4.2.    Data analysis of the input operands            32
Chapter 5    Proposed architecture                     37
5.1.    Top-level architecture                         37
5.2.    Architecture of the PE                         39
5.3.    The shift and difference-add booth multiplier  39
Chapter 6    Evaluation                                45
6.1.    Implementation setup                           45
6.2.    The memory accesses                            46
6.3.    The state toggle of the accumulation register  48
6.4.    Resource utilization comparison                50
6.5.    Throughput comparison                          51
Chapter 7    Conclusion                                53
Reference                                              54


                                

[1] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” International Conference on Learning Representations (ICLR), San Diego, California, USA , May 2015, pp. 1–14,
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You Only Look Once: Unified, Real-Time Object Detection”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, Jun. 2016, pp. 779–788.
[3] F. Richardson, D. Reynolds, N. Dehak, "A unified deep neural network for speaker and language recognition", INTERSPEECH, Dresden, Saxony, pp. 1146-1150, Sep. 2015.
[4] A. Canziani, A. Paszke and E. Culurciello, 2017, “An Analysis of Deep Neural Network Models for Practical Applications”, [online] Available: arXiv:1605.07678
[5] B. Zoph, V. Vasudevan, J. Shlens and Quoc V. Le, 2018, “Learning Transferable Architecture for Scalable Image Recognition”, [online] Available: arXiv:1707.07012
[6] G. Hunang, Z. Liu and L. van derMaaten, Jan. 2018, “Densely Connected Convolutional Networks”, [online] Available: arXiv:1608.06993
[7] M. Sankaradas, Venkata Jakkula, Srihari Cadambi et al, “A Massively Parallel Coprocessor for Convolutional Neural Networks,” 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, Boston, USA, July 2009, pp. 53-60
[8] V. Sriram, David Cox, K. Hung Tsoi et al, “Towards an Embedded Biologically-Inspired Machine Vision Processor”, International Conference on Field-Programmable Technology, Beijing, China, Dec. 2010, pp. 273-278
[9] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, Advances in Neural Information Processing Systems 25 (NIPS 2012), 2012, pp. 1-9
[10] D. D. Lin, S. S. Talathi, V. S. Annapureddy, “Fixed Point Quantizetion of Deep Convolutional Networks”, Proceeding of the 33rd International Conference on Machine Learning, New York, USA, 2016, pp2849-2858
[11] T. Sheng, C. Feng, S. Zhuo, X. Zhang, L. Shen and M. Aleksic, "A Quantization-Friendly Separable Convolution for MobileNets," 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Williamsburg, VA, 2018, pp. 14-18
[12] Y. Choukroun, E. Kravchik, F. Yang et al, 2019, “Low-bit Quantization of Neural Networks for Efficient Inference”, [online] Available: arXiv: 1902/06822v2
[13] K. Guo, L. Sui, J. Qiu et al, “Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware”, IEEE Computer Society Annual Symposium on VLSI(ISVLSI), Pittsburgh, USA, July 2016, pp 24-29
[14] R. Andri, L. Cavigelli, D. Rossi et al, “YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration”, IEEE Transactions on Computer-Aided Design of Integrated Circuits And Systems, vol. 37, pp48-60, Jan. 2018
[15] M. Peemen, A. A. A. Setio, B. Mesman and H. Corporaal, “Memory-Centric Accelerator Design for Convolutional Neural Networks”, IEEE 31st International Conference on Computer Design(ICCD), Asheville, USA, Oct.2013, pp. 13-19
[16] Y. Chen, T. Krisha, J. S. Emer et al, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”, IEEE Journal of Solid-State Circuits, vol. 52, pp127-138, Jan.2017
[17] G. Jha, E. John, “Performance Analysis of Single-Precision Floating-Point MAC for Deep Learning”, IEEE 61st International Midwest Symposium on Circuits and Systems(MWSCAS), Windsor, Canada, Aug. 2018, pp. 885-888
[18] V. Peluso, A. Calimera, “Weak-MAC: Arithmetic Relaxation for Dynamic Energy-Accuracy Scaling in ConvNets”, IEEE International Symposium on Circuits and Systems(ISCAS), Florence, Italy, May. 2018, pp. 1-4
[19] D. Esposito, A. G. M. Strollo, M. Alioto, “Low-power approximate MAC unit”, 2017 13th Conference on Ph.D Research in Microelectronics and Electronics(PRIME), Giardini, Italy, July 2017, pp. 81-84
[20] A. Anderson, A. Vasudevan, C. Keane et al., 2017, “Low-momory GEMM-based convolution algorithms for deep neural networks”, [online] Available: arXiv:1709.03395v1
[21] J. Deng, W. Dong, R. Socher, L. Li, K. Li and F. Li , "ImageNet: A large-scale hierarchical image database," IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 248-255.
[22] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.
[23] R. Girshick, J. Donahue, T. Darrell, J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation”, [online] Available: arXiv:1311.2524v5
[24] A. D. BOOTH, “A SIGNED BINARY MULTIPLICATION TECHNIQUE”, The Quarterly Journal of Mechanics and Applied Mathematics, Vol. 4, Issue 2, 1951, pp. 236–240,
[25] Y. Wang, J. Lin and Z. Wang, "FPAP: A Folded Architecture for Efficient Computing of Convolutional Neural Networks," 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Hong Kong, 2018, pp. 503-508.
[26] B. Dinesh, V. Venkateshwaran, P. Kavinmalar and M. Kathirvelu, "Comparison of regular and tree based multiplier architectures with modified booth encoding for 4 bits on layout level using 45nm technology," 2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE), Coimbatore, 2014, pp. 1-6.
[27] R. Krishnamoorthi, “Quantizing deep convolutional network for efficient inference”, [online] Available: arXiv:1806.08342v1
[28] T. Chen, Z. Du, N Sun, et al, “DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning”, Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, vol.14, pp.269-284, Feb.2014.
[29] Y. Chen, T. Luo, S. Liu, et al, “DaDianNao: A Machine-Learning Supercomputer”, 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 609-622.
[30] Z. Du et al., "ShiDianNao: Shifting vision processing closer to the sensor", ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, 2015, pp. 92-104.
[31] S. Migacz, “8-bit inference with TensorRT,” [online] Available: https://ondemand.gputechconf.com/gtc/2017/presentation/s7310-8-bit- inferencewith-tensorrt.pdf
[32] B. Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 2704-2713.
[33] X. Zhang et al., "DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs," 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, CA, 2018, pp. 1-8.
[34] Y. Chen, T. Yang, J. Emer and V. Sze, "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019.
[35] H. O. Ahmed, M. Ghoneima and M. Dessouky, "Concurrent MAC unit design using VHDL for deep learning networks on FPGA," 2018 IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, 2018, pp. 31-36.
[36] L. Lai, N. Suda, V. Chandra, “Deep Convolutional Neural Networks Inference with Floating-point Weights and Fixed-point Activations”, [online] Avaiabe: arXiv:1703.03073
[37] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim and H. Yoo, "UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision," in IEEE Journal of Solid-State Circuits, vol. 54, no. 1, pp. 173-185, Jan. 2019.
[38] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. P. Ienne, T. Cornu, G. Kuhn, "Special-purpose digital hardware for neural networks: An architectural survey", Journal of VLSI Signal Processing, vol. 13, no. 1, pp. 5-25, 1996.
[39] X. Wei et al., "Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs," 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, 2017, pp. 1-6.
[40] S. Wang, D. Zhou, X. Han and T. Yoshimura, "Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks," Design Automation & Test in Europe Conference & Exhibition (DATE), 2017, Lausanne, 2017, pp. 1032-1037.
[41] Y. Huan, J. Xu, L. Zheng, H. Tenhunen and Z. Zou, "A 3D Tiled Low Power Accelerator for Convolutional Neural Network," 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, 2018, pp. 1-5.
[42] V. Gokhale, J. Jin, A. Dundar, B. Martini and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, 2014, pp. 696-701.
[43] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in Proc. International Conference on Machine Learning (ICML), 2015, pp. 1737–1746.
[44] V. Sze, Y. Chen, T. Yang and J. S. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," in Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec. 2017.
[45] J. Garland and D. Gregg, "Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks," in IEEE Computer Architecture Letters, vol. 16, no. 2, pp. 132-135, 1 July-Dec. 2017.
[46] S. Han, H. Mao, W. Dally, "Deep compression: Compressing deep neural network with pruning trained quantization and huffman coding", CoRR, 2015.
[47] J. Chang, Y. Choi, T. Lee and J. Cho, "Reducing MAC operation in convolutional neural network with sign prediction," 2018 International Conference on Information and Communication Technology Convergence (ICTC), Jeju, 2018, pp. 177-182.
[48] C. Kim, D. Shin, B. Kim and J. Park, "Mosaic-CNN: A Combined Two-Step Zero Prediction Approach to Trade off Accuracy and Computation Energy in Convolutional Neural Networks," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 4, pp. 770-781, Dec. 2018.
[49] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proc. ISCA, 2016, pp. 367–379
[50] Kaur, Navdeep & Patial, Rajeev., “Implementation of Modified Booth Multiplier using Pipeline Technique on FPGA”, International Journal of Computer Applications. Vol. 68. pp38-41. 10.5120/11666-7261, 2013.
[51] D. Srinu et al, “Implementation of High Speed Signed Multiplier Using Compressor”, International Journal of Advanced Research in Electrical, Electronic and Instrumentation Engineering, Vol. 3, pp8096-8106, Mar.2014
[52] M. Barakat, W. Saad and M. Shokair, "Implementation of Efficient Multiplier for High Speed Applications Using FPGA," 2018 13th International Conference on Computer Engineering and Systems (ICCES), Cairo, Egypt, 2018, pp. 211-214, doi: 10.1109/ICCES.2018.8639254.

全文公開日期 2025/08/17 (校內網路)
全文公開日期 2025/08/17 (校外網路)
全文公開日期 2025/08/17 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文