簡易檢索 / 詳目顯示

研究生: 黃煜翔
Yu-Xiang Huang
論文名稱: 權重曉知與降低位元精確度之低成本AI加速器架構設計
Weight-Aware and Reduced-Precision Architecture Designs for Low-Cost AI Accelerators
指導教授: 呂學坤
Shyue-Kung Lu
口試委員: 李進福
許鈞瓏
洪進華
王乃堅
呂學坤
Shyue-Kung Lu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 112
語文別: 中文
論文頁數: 62
中文關鍵詞: 張量處理器低成本
外文關鍵詞: AI, TPU, Cost
相關次數: 點閱:35下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

由於深度神經網路 (Deep Neural Network, DNN) 快速的發展,至今已被廣
泛應用於各種任務,例如語音辨識、圖像辨識和自動駕駛等等,模型的訓練與推
論 (Inference) 需要大量的運算,這就促使了對於專用硬體加速器的需求,以滿
足深度學習應用的高效能運算,谷歌 (Google) 科技公司為了提高硬體的計算效
率而設計的張量處理器 (Tensor Processing Unit, TPU) 就是一種硬體加速器 [1],
給神經網路提供了良好的優化。
本篇論文提出的降低位元精確度之張量處理器 (Reduced Precision TPU,
RPTPU),主要是透過權重值 (Weight) 的特性來針對脈動陣列 (Systolic Array)
內部的處理單元 (Processing Element, PE) 進行面積成本的優化,本篇提出的方
法可以將權重值的精確度 (Precision) 縮減,同時不使深度神經網路模型的準確
度 (Accuracy) 大幅降低,藉由縮減輸入資料的精確度,將能使用更小的乘法器
(Multiplier) 進行乘法運算,以降低張量處理器的硬體成本。
本篇論文提出了第一型態的降低位元精確度之張量處理器 (Type-1 RPTPU)
與第二型態的降低位元精確度之張量處理器 (Type-2 RPTPU) 的架構,硬體成本
與傳統的張量處理器相比分別下降了 39 % 與 43 %,而更小的乘法器可以加快
硬體的計算速度,其操作頻率的效能 (Performance) 也提升了 1.15 倍至 1.19 倍,
且功率消耗也能降低 18 % 至 27 %。本篇論文與其他相關論文 [2, 3] 相比之下,
其操作頻率也能提升將近 1.15 倍以上的效能,且降低 10 % 至 34 % 的功率消
耗,在 MLP 模型的準確度相比之下,本篇論文在相差不遠的硬體成本中準確度
也高了 3 % 以上。


Due to the rapid development of Deep Neural Networks (DNN), they have been
widely applied in various tasks such as speech recognition, image recognition, and
autonomous driving, etc. The training and inference of DNN models require extensive
computations, leading to a demand for specialized hardware accelerators to meet the
high computation needs for deep learning applications. Tensor Processing Unit (TPU)
is a dedicated hardware accelerator designed by Google to improve hardware
computation efficiency [1], providing optimizations for neural networks.
This thesis presents the Reduced Precision Tensor Processing Unit (RPTPU),
which aims to optimize the area cost of processing elements (PEs) within the systolic
array-based TPU by leveraging the characteristics of weight values. The proposed
techniques reduce the precision of weight values without significant accuracy drops.
By reducing the precision of the input data, smaller multipliers can be used for
multiplication operations, thereby lowering the hardware cost of the tensor processing
unit. Two types of Reduced Precision Tensor Processing Units, namely Type-1 RPTPU
and Type-2 RPTPU, are proposed in this thesis. The hardware costs are reduced by 39%
and 43%, respectively as compared to the traditional tensor processing unit. By using
smaller multipliers can also accelerate the hardware computation speed from 1.15 to
1.19 times. Additionally, power consumption can be reduced from 18% to 27%. As
compared to other relevant papers [2] and [3], this paper achieves an improvement in
performance by almost 1.15 times and a reduction in power consumption by 10% and
34%, respectively. In terms of accuracy for MLP models, this thesis can increase 3%
inference accuracy with similar hardware costs.

致謝 i 摘要 ii Abstract iii 目錄 iv 圖目錄 vi 表目錄 x 第一章 簡介 1 1.1背景與動機 1 1.2組織架構 5 第二章 深度神經網路之基本原理 6 2.1神經元與深度神經網路原理與架構 6 2.2全連結神經網路 10 2.3卷積神經網路 11 第三章 張量處理器 (Tensor Processing Unit, TPU) 17 3.1張量處理器之原理 17 3.2張量處理器之架構 18 3.3張量處理器之操作流程 20 3.3.1權重固定之資料流 21 3.3.2輸出固定之資料流 25 3.4降低成本之張量處理器之相關研究 28 3.4.1動態範圍之無偏乘法器 [2] 28 3.4.2近似計算之張量處理器 [3] 30 第四章 降低位元精確度之張量處理器 (Reduced Precision TPU, RPTPU) 34 4.1權重值分佈之觀察 34 4.2權重值預處理之方式 35 4.3降低位元精確度之張量處理器原理 37 4.4降低位元精確度之張量處理器架構 38 4.4.1第一型態的降低位元精確度之張量處理器架構 (Type-1 RPTPU) 39 4.4.2第二型態的降低位元精確度之張量處理器架構 (Type-2 RPTPU) 43 第五章 實驗結果 47 5.1深度學習模型設定 47 5.2準確度分析 48 5.3硬體成本分析 51 5.4張量處理器之數位電路模擬 52 5.5超大型積體電路實現 56 第六章 結論與未來展望 58 6.1結論 58 6.2未來展望 58 參考文獻 59

[1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, N. Boden, A. Borchers, and R. Boyle, “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in Proc. 44th Annu. Int. Symp. Comput. Archit., pp. 3–4, June 2017.
[2] S. Hashemi, R. I. Bahar, and S. Reda, “DRUM: A Dynamic Range Unbiased Multiplier for Approximate Applications,” in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, pp. 419–420, Nov. 2015.
[3] Mohammed E. Elbtity, Peyton S. Chandarana, Brendan Reidy, Jason K. Eshraghian, and Ramtin Zand, “APTPU: Approximate Computing Based Tensor Processing Unit,” IEEE Trans. Circuits Syst. I, vol. 69, no. 12, pp. 5135–5144, Dec. 2022.
[4] W. A. Wulf and S. A. McKee, “Hitting the Memory Wall: Implications of the Obvious,” ACM SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20–24, Mar. 1995.
[5] M. Valle, “Analog VLSI Implementation of Artificial Neural Networks with Supervised On-Chip Learning,” Analog Integr. Circuits and Signal Process., vol. 33, pp. 263–287, Dec. 2002.
[6] M. Bouvier, A. Valentian, T. Mesquida, F. Rummens, M. Reyboz, E. Vianello, and E. Beigne, “Spiking Neural Networks Hardware Implementations and Challenges: A Survey,” ACM J. on Emerg. Technol. in Comput. Syst., vol. 15, no. 2, pp. 1–35, Apr. 2019.
[7] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A Survey of Accelerator Architectures for Deep Neural Networks,” Engineering, vol. 6, no. 3, pp. 264–274, Mar. 2020.
[8] S. Mittal, “A Survey on Optimized Implementation of Deep Learning Models on the NVIDIA Jetson Platform,” Journal of Syst. Archit., vol. 97, pp. 428–442, Aug. 2019.
[9] L. Deng, J. Li, J. T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams, Y. Gong, and A. Acero, “Recent Advances in Deep Learning for Speech Research at Microsoft,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp.8604–8608, May 2013.
[10] P. J. Bannon and K. A. Hurd, “Accelerated Mathematical Engine,” U.S. Patent 0026078 A1, Jan.2019.
[11] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented Approximation of Convolutional Neural Networks,” arXiv preprint arXiv:1604.03168, pp. 1–10, Oct. 2016.
[12] T. F. Hsieh, J. F. Li, J. S. Lai, C. Y. Lo, D. M. Kwai, and Y. F. Chou, “Refresh Power Reduction of DRAMs in DNN Systems Using Hybrid Voting and ECC Method,” in Proc. IEEE International Test Conference in Asia, pp. 41–46, Sep. 2020.
[13] H. Ramchoun, M. A. Idrissi, Y. Ghanou, and M. Ettaouil, “Multilayer Perceptron: Architecture Optimization and Training with Mixed Activation Functions,” in Proc. of the 2nd Int. Conf. on Big Data, Cloud and Applications, pp. 1–6, Mar.2017.
[14] M. Kayed, A. Anter, and H. Mohamed, “Classification of Garments from Fashion MNIST Dataset Using CNN LeNet-5 Architecture,” International Conference on Innovative Trends in Communication and Computer Engineering, pp. 238–243, Feb. 2020.
[15] M. F. Haque, H. Y. Lim, and D. S. Kang, “Object Detection Based on VGG with ResNet Network,” in Proc. International Conference on Electronics, Information, and Communication, pp. 1–3, Jan. 2019.
[16] T. Shanthi and R. S. Sabeenian, “Modified Alexnet Architecture for Classification of Diabetic Retinopathy Images,” Computers & Electrical Engineering, pp. 56–64, June 2019.
[17] N. Brunel, V. Hakim, and M. J. E. Richardson, “Single Neuron Dynamics and Computation,” Current Opinion in Neurobiology, vol. 25, pp. 149–155, Apr. 2014.
[18] X. Yin, J. Goudriaan, E. A. Lantinga, J. Vos, and H. J. Spiertz, “A Flexible Sigmoid Function of Determinate Growth,’’ Ann. Botany, vol. 91, no. 3, pp. 361–371, Feb. 2003.
[19] M. M. Lau and K. H. Lim, “Review of Adaptive Activation Function in Deep Neural Network,” in Proc. IEEE-EMBS Conf. Biomed. Eng. Sci., pp. 686–690, Dec. 2018.
[20] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proc. 14th Int. Conf. Artif. Intell. Statist., vol. 15, pp. 315–323, Apr. 2011.
[21] A. Schwing and R. Urtasun, “Fully Connected Deep Structured Networks,” arXiv:1503.02351, pp. 1–10, Mar. 2015.
[22] C. Nebauer, “Evaluation of Convolutional Neural Networks for Visual Recognition,” IEEE Transactions on Neural Networks, vol. 9, no. 4, pp. 685–696, Jan. 1998.
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Adv. Neural Inf. Process. Syst., vol. 25, pp. 1097–1105, Dec. 2012.
[24] B. Graham, “Fractional Max-pooling,” arXiv preprint arXiv:1412.6071, pp. 1–10, Dec. 2014.
[25] W. S. Hua, V. Govindaraj, S. L. Fernandes, Z. Zhu, and Z. Y. Dong, “Deep Rank-Based Average Pooling Network for COVID-19 Recognition,” Computers, Materials & Continua, vol. 70, no. 2, pp. 2797–2813, Jan. 2022.
[26] S. Kaufman, P. Phothilimthana, Y. Zhou, C. Mendis, S. Roy, A. Sabne, and M. Burrows, “A Learned Performance Model for Tensor Processing Units,” Proceedings of Machine Learning and Systems 3, pp. 1–14, Mar. 2021.
[27] S. Venkataramani et al., “RaPiD: AI Accelerator for Ultra-low Precision Training and Inference,” in Proc. ACM/IEEE 48th Annu. Int. Symp. Comput. Archit., pp. 153–166, June 2021.
[28] S. Ryu, et al., “BitBlade: Energy-efficient Variable Bit-precision Hardware Accelerator for Quantized Neural Networks,” IEEE J. Solid-State Circuits, vol. 57, no. 6, pp. 1924-1935, June 2022.
[29] B. Asgari, R. Hadidi, H. Kim, and S. Yalamanchili, “ERIDANUS: Efficiently Running Inference of DNNs Using Systolic Arrays,” IEEE Micro, vol. 39, no. 5, pp. 46–54, Sep./Oct. 2019.
[30] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna, “SCALE-Sim: Systolic CNN Accelerator Simulator,” arXiv preprint arXiv:1811.02883, pp. 1–11, Oct. 2018.
[31] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” in Proc. of the IEEE, vol. 105, no. 12, pp. 2295–2329, Dec. 2017.
[32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic Differentiation in PyTorch,” in NIPS Workshop, pp. 1-4, Oct. 2017.
[33] L. Deng, “The MNIST Database of Handwritten Digit Images for Machine Learning Research,’’ IEEE Signal Process. Mag., vol. 29, no. 6, pp. 141–142, Nov. 2012.

QR CODE