簡易檢索 / 詳目顯示

研究生: 黃郁雯
Yu-Wen Huang
論文名稱: 基於不同教師模型架構進行知識蒸餾之研究
Study of Knowledge Distillation Based on Various Teacher Model Architectures
指導教授: 呂政修
Jenq-Shiou Leu
口試委員: 周承復
Cheng-Fu Chou
衛信文
Hsin-Wen Wei
王瑞堂
Jui-Tang Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 43
中文關鍵詞: 深度學習模型壓縮知識蒸餾
外文關鍵詞: Deep learning, Model Compression, Knowledge Distillation
相關次數: 點閱:367下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著類神經網路技術的高度發展,為了符合更複雜且高準確度要求的使用情境,近年來被提出的類神經網路的模型體積愈來愈龐大,運算複雜度也不斷提升,而擁有良好辨識能力的網路模型往往伴隨著相當高的硬體需求。為了使複雜的模型能夠實際部署在有限效能的硬體平台上,多種模型壓縮技術隨之誕生,其中知識蒸餾提取複雜教師模型的知識轉移到相對小的學生模型架構上,進而達到提高模型效能的結果。
    本論文基於知識蒸餾探討兩種教師模型架構,分別為未完整訓練教師架構與多教師助教架構:未完整訓練教師架構是使用沒有經過完整訓練的教師模型蒸餾學生模型,本論文嘗試使用不同程度的未完整訓練教師模型進行蒸餾,經實驗結果顯示其蒸餾效果與經過完整訓練的教師模型相同,更可以大幅縮短訓練時間;本論文也提出多教師助教模型架構,其結合了多教師模型架構與助教模型架構,經實驗結果顯示蒸餾效果優於原先兩種架構。最後再將未完整訓練教師架構與多教師助教架構結合,使得知識蒸餾可以提升準確率和減少模型的訓練時間。


    With the rapid development of neural network technology, in order to meet more complex and high accuracy requirements, the model volume of neural network proposed in recent years has become larger and larger, and the computational complexity has also increased. Network models with good recognition capabilities are often accompanied by high hardware requirements. In order to enable complex models to be actually deployed on hardware platforms with limited performance, a variety of model compression techniques have emerged. Among them, knowledge distillation extracts the knowledge of the complex teacher model and transfers it to the relatively small student model, thereby achieving the result of improving the effectiveness of the model.
    In this paper, we explore two teacher model architectures based on knowledge distillation, namely Poorly-Trained-Teacher and Multi-Teacher-Assistant. Poorly-Trained-Teacher is to extract knowledge from a incompletely trained teacher model and transfer it to a student model, we try several teacher models that were trained with different epochs for knowledge distillation. The experimental results show that the performance of Poorly-Trained-Teacher model is much the same as that of the fully trained teacher model, but the training time can be greatly shortened. We also propose Multi-Teacher-Assistant architecture, which combines the multi-teacher architecture and teacher-assistant architecture, the experimental results show that the performance is better than the original two architectures. Finally, we merge two proposed architectures, so that it can improve the accuracy and reduce the model training time as well.

    基於不同教師模型架構進行知識蒸餾之研究 I 論文摘要 I ABSTRACT II 致謝 III 目錄 IV 圖片索引 VI 第 1 章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 3 1.3 章節提要 4 第 2 章 相關技術 5 2.1 剪枝 5 2.2 量化 6 2.3 低秩分解 7 2.4 知識蒸餾 8 第 3 章 研究方法 9 3.1 知識蒸餾 9 3.1.1 知識蒸餾架構 9 3.1.2 Soft Target 10 3.1.3 Temperature 12 3.1.4 Cross Entropy 13 3.2 知識蒸餾教師架構設計 15 3.2.1 未完整訓練教師 15 3.2.2 多教師助教模型 16 第 4 章 實驗結果 19 4.1 實驗準備 19 4.1.1 實驗環境 19 4.1.2 實驗資料集 20 4.1.3 模型網路架構 21 4.2 實驗結果 23 4.2.1 未完整訓練教師 23 4.2.2 多教師助教模型 27 第 5 章 結論 31 參考文獻 32

    [1] S. Ravi, "Custom on-device ML models with Learn2Compress," ed, 2018.
    [2] A. Polino, R. Pascanu, and D. Alistarh, "Model compression via distillation and quantization," arXiv preprint arXiv:1802.05668, 2018.
    [3] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, "Improved knowledge distillation via teacher assistant," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 04, pp. 5191-5198.
    [4] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, "Revisit knowledge distillation: a teacher-free framework," arXiv preprint arXiv:1909.11723, 2019.
    [5] J. H. Cho and B. Hariharan, "On the efficacy of knowledge distillation," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4794-4802.
    [6] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, "Self-training with noisy student improves imagenet classification," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10687-10698.
    [7] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, "A survey of model compression and acceleration for deep neural networks," arXiv preprint arXiv:1710.09282, 2017.
    [8] J. Cheng, P.-s. Wang, G. Li, Q.-h. Hu, and H.-q. Lu, "Recent advances in efficient computation of deep convolutional neural networks," Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 64-77, 2018.
    [9] T. Choudhary, V. Mishra, A. Goswami, and J. Sarangapani, "A comprehensive survey on model compression and acceleration," Artificial Intelligence Review, pp. 1-43, 2020.
    [10] Y. LeCun, J. Denker, and S. Solla, "Optimal brain damage," Advances in neural information processing systems, vol. 2, pp. 598-605, 1989.
    [11] S. Han, J. Pool, J. Tran, and W. Dally, "Learning both weights and connections for efficient neural network," Advances in neural information processing systems, vol. 28, pp. 1135-1143, 2015.
    [12] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, "Quantized convolutional neural networks for mobile devices," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4820-4828.
    [13] Z. Cai, X. He, J. Sun, and N. Vasconcelos, "Deep learning with low precision by half-wave gaussian quantization," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5918-5926.
    [14] B. Jacob et al., "Quantization and training of neural networks for efficient integer-arithmetic-only inference," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704-2713.
    [15] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, "Efficient and accurate approximations of nonlinear convolutional networks," in Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, 2015, pp. 1984-1992.
    [16] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li, "Coordinating filters for faster deep neural networks," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 658-666.
    [17] G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015.
    [18] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, "Fitnets: Hints for thin deep nets," arXiv preprint arXiv:1412.6550, 2014.
    [19] 江郡邦, "運用多樣教師結構進行知識蒸餾於模型壓縮之研究," 碩士, 電子工程系, 國立臺灣科技大學, 台北市, 2020. [Online]. Available: https://hdl.handle.net/11296/sjtdjz
    [20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
    [21] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.

    QR CODE