簡易檢索 / 詳目顯示

研究生: 陳嘉君
Chia-Chun Chen
論文名稱: 基於殘差卷積神經網路之影像分類及FPGA硬體加速器實現
Image Classification Based on Residual Convolutional Neural Network and FPGA Hardware Accelerator Implementation
指導教授: 楊振雄
Cheng-Hsiung Yang
口試委員: 楊振雄
顏志達
吳常熙
李敏凡
學位類別: 碩士
Master
系所名稱: 工程學院 - 自動化及控制研究所
Graduate Institute of Automation and Control
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 110
中文關鍵詞: FPGA硬體加速器卷積神經網路影像分類深度學習
外文關鍵詞: FPGA, Hardware Accelerator, CNN, Image Classification, Deep Learning
相關次數: 點閱:147下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著深度學習技術的快速發展,卷積神經網路(Convolutional Neural Network, CNN)在影像分類任務中取得了顯著的成果。然而,傳統的CNN模型因其龐大的參數量和高昂的計算成本,使其難以在資源受限的嵌入式系統中高效運行。因此,設計輕量化的卷積神經網路成為應用於嵌入式系統的必要條件之一。然而,輕量化模型通常會導致分類準確率下降。在不增加參數量且不涉及複雜運算的前提下,如何構建具備一定準確率的輕量化模型,並高效地在嵌入式系統運行是重要的研究課題之一。
    本研究旨在設計一個基於殘差卷積神經網路(Residual Convolutional Neural Network, ResNet)的輕量級影像分類模型,並探討其在FPGA(Field Programmable Gate Array)上的硬體加速實現方案。在輕量化神經網路的研究中,我們將重點研究引入殘差結構(Residual Block)對輕量化神經網路所帶來的效益。接著,我們探討如何在FPGA上高效實現本論文改善的輕量化神經網路推論,包括計算單元部屬的優化、內存空間和數據傳輸策略。為了充分利用FPGA片上記憶體,本論文將模型的權重和中間特徵圖存放於片上記憶體中,以降低訪問外部記憶體的頻率,減少數據傳輸的延遲和功耗。此外,FPGA中的數位訊號處理器(Digital Signal Processor)運算資源可大幅提升卷積運算的效率,本論文充分利用FPGA中的數位訊號處理器優化神經網路的運算瓶頸層,使其在推論時更具優勢。
    在實驗結果中,本論文透過Xilinx Vivado驗證所設計的硬體架構之可行性,並將輕量化神經網路實際部屬於Xilinx FPGA ZCU-104實現高效的影像分類推論,相較CPU提升了約86倍的推論速度。


    With the rapid advancement of deep learning technology, CNNs have made significant strides in image classification. However, their large parameters and high computational costs impede their efficiency in resource-constrained embedded systems. Designing lightweight CNNs for embedded systems and efficiently deploying them for inference on FPGA is a crucial research topic.
    This study aims to design a lightweight image classification model based on the Residual Convolutional Neural Network and explore its hardware acceleration implementation on Field Programmable Gate Arrays (FPGAs). In the research on lightweight neural networks, we focus on the benefits of introducing residual structures (Residual Blocks). Subsequently, we investigate how to efficiently implement the improved lightweight neural network inference on FPGAs, including optimization of compute units deployment, memory space, and data transmission strategies. To fully utilize the on-chip memory of FPGAs, this paper stores the model's weights and intermediate feature maps in on-chip memory to reduce the frequency of accessing external memory, thereby reducing data transmission latency and power consumption. Additionally, This paper fully utilizes the DSP resources in FPGAs to optimize the computation bottleneck of neural networks, making them more advantageous during inference.
    In the experimental results, this paper verifies the feasibility of the designed hardware architecture through Xilinx Vivado and deploys the lightweight neural network on the Xilinx FPGA ZCU-104 to achieve efficient image classification inference, achieving a speedup of approximately 86 times compared to CPU inference.

    致謝 I 摘要 II Abstract III 目錄 IV 圖目錄 VII 表目錄 XIII 第一章 緒論 1 1.1 前言 1 1.2 文獻回顧 1 1.3 研究動機 2 1.4 論文架構 3 第二章 深度學習演算法 4 2.1 深度學習 4 2.2 Convolutional Neural Network 8 2.3 Convolution Layer 9 2.3.1 Kernel & Stride 10 2.3.2 Padding 12 2.3.3 Multi Channel Convolution 12 2.4 Pooling Layer 15 2.5 Activation Function 16 2.6 Fully Connected Layer 20 2.7 Residual Block 22 2.8 Dropout Layer 24 2.9 Loss Function 25 2.10 Metrics 27 2.11 模型建構與分析 29 2.12 推論數據轉換 39 第三章 硬體架構設計 41 3.1 FPGA硬體電路開發 41 3.1.1 Configurable Logic Blocks 44 3.1.2 Digital Signal Processor 46 3.1.3 FPGA開發流程 47 3.2 AXI Protocol 56 3.3 卷積運算硬體架構分析 60 3.4 硬體加速器設計 69 3.4.1 Line Buffer Unit 73 3.4.2 Convolution Unit 76 3.4.3 Activation Unit 78 3.4.4 Max Pooling Unit 80 3.4.5 Fully Connected Unit 81 3.4.6 Double Buffer 82 第四章 實驗結果 84 4.1 硬體電路模擬 84 4.1.1 Line Buffer Unit功能模擬 84 4.1.2 Convolution Unit功能模擬 87 4.1.3 Activation Unit功能模擬 88 4.1.4 Max Pooling Unit功能模擬 90 4.1.5 Fully Connected Unit功能模擬 93 4.1.6 Double Buffer功能模擬 94 4.2 硬體加速器設計結果 97 4.2.1 硬體加速器功能模擬 97 4.2.2 FPGA實現結果 98 4.3 實驗比較與分析 100 第五章 結論與未來展望 105 5.1 結論 105 5.2 未來展望 105 參考文獻 107

    [1] K. Kakuda, T. Enomoto, and S. Miura "Nonlinear Activation Functions in CNN Based on Fluid Dynamics and Its Applications," Comput. Model. Eng. Sci., vol. 118, no. 1, pp. 1-14, (2019)
    [2] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June, (2017)
    [3] K. He, G. Gkioxari, P. Dollár and R. Girshick, "Mask R-CNN," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 386-397, 1 Feb, (2020)
    [4] B. Zhao, Y. Wang, H. Zhang, J. Zhang, Y. Chen and Y. Yang, "4-bit CNN Quantization Method With Compact LUT-Based Multiplier Implementation on FPGA," in IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1-10, (2023)
    [5] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke and Andrew Rabinovich. “Going deeper with convolutions.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 1-9, (2014)
    [6] Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556, (2014)
    [7] He, Kaiming, X. Zhang, Shaoqing Ren and Jian Sun. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 770-778, (2015)
    [8] A. Huang, Z. Cao, C. Wang, J. Wen, F. Lu and L. Xu, "An FPGA-based on-chip neural network for TDLAS tomography in dynamic flames", IEEE Trans. Instrum. Meas., vol. 70, pp. 1-11, (2021)
    [9] Iandola, Forrest N., Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally and Kurt Keutzer. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size.” ArXiv abs/1602.07360 (2016)
    [10] Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto and Hartwig Adam. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” ArXiv abs/1704.04861 (2017)
    [11] P. Swierczynski, M. Fyrbiak, C. Paar, C. Huriaux and R. Tessier, "Protecting against Cryptographic Trojans in FPGAs," 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, Vancouver, BC, Canada, pp. 151-154, (2015)
    [12] R. Nane et al., "A Survey and Evaluation of FPGA High-Level Synthesis Tools," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 35, no. 10, pp. 1591-1604, Oct, (2016)
    [13] CIFAR-10 Dataset
    [14] LeCun, Y., Bengio, Y. & Hinton, G. "Deep learning." Nature 521, (2015): 436–444.
    [15] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25, (2012)
    [16] Sherstinsky, Alex. “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network.” ArXiv abs/1808.03314, (2018)
    [17] T. N. Sainath, O. Vinyals, A. Senior and H. Sak, "Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, pp. 4580-4584, (2015)
    [18] Li, Hongsheng et al. “Highly Efficient Forward and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification.” ArXiv abs/1412.4526 (2014)
    [19] Etienne Dupuis, David Novo, Ian O'Connor, Alberto Bosio," CNN weight sharing based on a fast accuracy estimation metric," Microelectronics Reliability, Volume 122, (2021)
    [20] Zeiler, Matthew D. and Rob Fergus. “Visualizing and Understanding Convolutional Networks.” ArXiv abs/1311.2901 (2013)
    [21] V. H. Kim and K. K. Choi, "A Reconfigurable CNN-Based Accelerator Design for Fast and Energy-Efficient Object Detection System on Mobile FPGA," in IEEE Access, vol. 11, pp. 59438-59445, (2023)
    [22] L. Xuan, K. -F. Un, C. -S. Lam and R. P. Martins, "An FPGA-Based Energy-Efficient Reconfigurable Depthwise Separable Convolution Accelerator for Image Recognition," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 10, pp. 4003-4007, Oct, (2022)
    [23] Ming Xia, Zunkai Huang, Li Tian, Hui Wang, Victor Chang, Yongxin Zhu, Songlin Feng, "SparkNoC: An energy-efficiency FPGA-based accelerator using optimized lightweight CNN for edge computing," Journal of Systems Architecture, Volume 115, (2021)
    [24] Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, Shaojun Wei, "FP-BNN: Binarized neural network on FPGA," Neurocomputing, Volume 275, (2018)
    [25] Vinod Nair, Geoffrey E. Hinton, "Rectified linear units improve restricted boltzmann machines," In Proceedings of the 27th International Conference on International Conference on Machine Learning, (2010)
    [26] Glorot Xavier, Bordes Antoine, Bengio Yoshua, "Deep Sparse Rectifier Neural Networks," Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, (2011)
    [27] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. "Rectifier nonlinearities improve neural network acoustic models." Proc. icml. Vol. 30. No. 1, (2013)
    [28] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov, "Improving neural networks by preventing co-adaptation of feature detectors," (2012)
    [29] Bishop, C. M, "Pattern Recognition and Machine Learning," Springer, (2006)
    [30] Marina Sokolova, Guy Lapalme, "A systematic analysis of performance measures for classification tasks," Information Processing & Management, Volume 45, (2009)
    [31] Ioffe Sergey, Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” ArXiv abs/1502.03167, (2015)
    [32] Song Han, Jeff Pool, John Tran, William Dally, "Learning both weights and connections for efficient neural network," Advances in neural information processing systems 28, (2015)
    [33] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan, "Deep Learning with Limited Numerical Precision," Proceedings of the 32nd International Conference on Machine Learning, (2015)
    [34] T. Geng et al., "O3BNN-R: An Out-of-Order Architecture for High-Performance and Regularized BNN Inference," in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 1, pp. 199-213, 1 Jan, (2021)
    [35] Y. Kim, Q. Tong, K. Choi, E. Lee, S. Jang, and B. Choi, ‘‘System level power reduction for YOLO2 sub-modules for object detection of future autonomous vehicles,’’ in Proc. Int. SoC Design Conf. (ISOCC), pp. 151–155, Nov, (2018)
    [36] V. H. Kim and K. K. Choi, "A Reconfigurable CNN-Based Accelerator Design for Fast and Energy-Efficient Object Detection System on Mobile FPGA," in IEEE Access, vol. 11, pp. 59438-59445, (2023)
    [37] T. Geng et al., “O3BNN-R: An out-of-order architecture for high performance and regularized BNN inference,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 1, pp. 199–213, Jan, (2021)
    [38] Rouhani, Bita Darvish, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Dusan Stosic, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verilli, Ralph Wittig, Doug Burger and Eric S.Chung. “Microscaling Data Formats for Deep Learning.” ArXiv abs/2310.10537 (2023)
    [1] K. Kakuda, T. Enomoto, and S. Miura "Nonlinear Activation Functions in CNN Based on Fluid Dynamics and Its Applications," Comput. Model. Eng. Sci., vol. 118, no. 1, pp. 1-14, (2019)
    [2] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June, (2017)
    [3] K. He, G. Gkioxari, P. Dollár and R. Girshick, "Mask R-CNN," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 386-397, 1 Feb, (2020)
    [4] B. Zhao, Y. Wang, H. Zhang, J. Zhang, Y. Chen and Y. Yang, "4-bit CNN Quantization Method With Compact LUT-Based Multiplier Implementation on FPGA," in IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1-10, (2023)
    [5] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke and Andrew Rabinovich. “Going deeper with convolutions.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 1-9, (2014)
    [6] Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556, (2014)
    [7] He, Kaiming, X. Zhang, Shaoqing Ren and Jian Sun. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 770-778, (2015)
    [8] A. Huang, Z. Cao, C. Wang, J. Wen, F. Lu and L. Xu, "An FPGA-based on-chip neural network for TDLAS tomography in dynamic flames", IEEE Trans. Instrum. Meas., vol. 70, pp. 1-11, (2021)
    [9] Iandola, Forrest N., Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally and Kurt Keutzer. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size.” ArXiv abs/1602.07360 (2016)
    [10] Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto and Hartwig Adam. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” ArXiv abs/1704.04861 (2017)
    [11] P. Swierczynski, M. Fyrbiak, C. Paar, C. Huriaux and R. Tessier, "Protecting against Cryptographic Trojans in FPGAs," 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, Vancouver, BC, Canada, pp. 151-154, (2015)
    [12] R. Nane et al., "A Survey and Evaluation of FPGA High-Level Synthesis Tools," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 35, no. 10, pp. 1591-1604, Oct, (2016)
    [13] CIFAR-10 Dataset
    [14] LeCun, Y., Bengio, Y. & Hinton, G. "Deep learning." Nature 521, (2015): 436–444.
    [15] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25, (2012)
    [16] Sherstinsky, Alex. “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network.” ArXiv abs/1808.03314, (2018)
    [17] T. N. Sainath, O. Vinyals, A. Senior and H. Sak, "Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, pp. 4580-4584, (2015)
    [18] Li, Hongsheng et al. “Highly Efficient Forward and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification.” ArXiv abs/1412.4526 (2014)
    [19] Etienne Dupuis, David Novo, Ian O'Connor, Alberto Bosio," CNN weight sharing based on a fast accuracy estimation metric," Microelectronics Reliability, Volume 122, (2021)
    [20] Zeiler, Matthew D. and Rob Fergus. “Visualizing and Understanding Convolutional Networks.” ArXiv abs/1311.2901 (2013)
    [21] V. H. Kim and K. K. Choi, "A Reconfigurable CNN-Based Accelerator Design for Fast and Energy-Efficient Object Detection System on Mobile FPGA," in IEEE Access, vol. 11, pp. 59438-59445, (2023)
    [22] L. Xuan, K. -F. Un, C. -S. Lam and R. P. Martins, "An FPGA-Based Energy-Efficient Reconfigurable Depthwise Separable Convolution Accelerator for Image Recognition," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 10, pp. 4003-4007, Oct, (2022)
    [23] Ming Xia, Zunkai Huang, Li Tian, Hui Wang, Victor Chang, Yongxin Zhu, Songlin Feng, "SparkNoC: An energy-efficiency FPGA-based accelerator using optimized lightweight CNN for edge computing," Journal of Systems Architecture, Volume 115, (2021)
    [24] Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, Shaojun Wei, "FP-BNN: Binarized neural network on FPGA," Neurocomputing, Volume 275, (2018)
    [25] Vinod Nair, Geoffrey E. Hinton, "Rectified linear units improve restricted boltzmann machines," In Proceedings of the 27th International Conference on International Conference on Machine Learning, (2010)
    [26] Glorot Xavier, Bordes Antoine, Bengio Yoshua, "Deep Sparse Rectifier Neural Networks," Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, (2011)
    [27] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. "Rectifier nonlinearities improve neural network acoustic models." Proc. icml. Vol. 30. No. 1, (2013)
    [28] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov, "Improving neural networks by preventing co-adaptation of feature detectors," (2012)
    [29] Bishop, C. M, "Pattern Recognition and Machine Learning," Springer, (2006)
    [30] Marina Sokolova, Guy Lapalme, "A systematic analysis of performance measures for classification tasks," Information Processing & Management, Volume 45, (2009)
    [31] Ioffe Sergey, Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” ArXiv abs/1502.03167, (2015)
    [32] Song Han, Jeff Pool, John Tran, William Dally, "Learning both weights and connections for efficient neural network," Advances in neural information processing systems 28, (2015)
    [33] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan, "Deep Learning with Limited Numerical Precision," Proceedings of the 32nd International Conference on Machine Learning, (2015)
    [34] T. Geng et al., "O3BNN-R: An Out-of-Order Architecture for High-Performance and Regularized BNN Inference," in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 1, pp. 199-213, 1 Jan, (2021)
    [35] Y. Kim, Q. Tong, K. Choi, E. Lee, S. Jang, and B. Choi, ‘‘System level power reduction for YOLO2 sub-modules for object detection of future autonomous vehicles,’’ in Proc. Int. SoC Design Conf. (ISOCC), pp. 151–155, Nov, (2018)
    [36] V. H. Kim and K. K. Choi, "A Reconfigurable CNN-Based Accelerator Design for Fast and Energy-Efficient Object Detection System on Mobile FPGA," in IEEE Access, vol. 11, pp. 59438-59445, (2023)
    [37] T. Geng et al., “O3BNN-R: An out-of-order architecture for high performance and regularized BNN inference,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 1, pp. 199–213, Jan, (2021)
    [38] Rouhani, Bita Darvish, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Dusan Stosic, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verilli, Ralph Wittig, Doug Burger and Eric S. Chung. “Microscaling Data Formats for Deep Learning.” ArXiv abs/2310.10537 (2023)
    [39] Zynq® UltraScale+™ MPSoC Data Sheet: Overview (DS891)
    [40] Xilinx UltraScale Architecture CLB User Guide

    QR CODE