簡易檢索 / 詳目顯示

研究生: 洪宇薇
Yu-Wei Hong
論文名稱: 以知識蒸餾實現模型壓縮之分析
Analysis of Model Compression Using Knowledge Distillation
指導教授: 呂政修
Jenq-Shiou Leu
口試委員: 周承復
Cheng-Fu Chou
林敬舜
Ching-Shun Lin
陳郁堂
Yie-Tarng Chen
方文賢
Wen-Hsien Fang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 41
中文關鍵詞: 深度學習模型壓縮知識蒸餾分析視覺化熱度圖
外文關鍵詞: Deep Learning, Model Compression, Knowledge Distillation, analysis, visualization, heatmap
相關次數: 點閱:331下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著深度學習的發展,為了因應人類不同種的需求及情境,網路架構被設計的越來越龐大且複雜。這些複雜的網路架構往往造成使用者必須消耗一定的運算資源、記憶體空間,而且不夠即時。為了解決這些缺點,模型壓縮技術領域是一個值得研究的領域。而使用者必須知道如何根據自己的需求,選擇適合的模型壓縮技術及壓縮結果,以達到需求以及效能犧牲的平衡點。

    這篇論文中,在已知需壓縮目標模型的情況下,我們提出了兩種模型壓縮方式:切通道深度、切模型層數。在設計出壓縮過後的模型架構之後,實施知識蒸餾以提昇壓縮模型的辨識率。最後,我們將會展示如何用不同角度分析模型壓縮的成果,並且針對結果提出一些取得效能與需求之間的平衡上的建議。在實驗結果中,MobileNet_v1若是以通道深度壓縮方式,可至少壓縮42.27%,若是以層數壓縮方式則至少可壓縮32.42%。除此之外,知識蒸餾所能提昇的準確率以通道深度壓縮方式中尤為有效(多於4.71%)。


    In this paper, given a model to compress, we propose two kinds of model compression: cut the network width-wise and layer-wise. Afterwards, Knowledge Distillation is deployed to compensate and improve classifiers' accuracy. At the end, we also demonstrate how to analyze those compressed models from a variety of perspectives, and come up with several suggestions about the trade-off between performance (inference time and accuracy) and compression rate. In the results of experiments, the compression rate of width-wise compression on MobileNet_v1 is at least 42.27 %, whereas that of layer-wise compression is at least 32.42 %. Moreover, the improvement of accuracy between procedure with and without Knowledge Distillation is especially notable for layer-wise compression (more than 4.71 \%).

    Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Research Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Outline of All Chapters . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Knowledge Distillation. . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.1 Soft labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.2 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Deep Taylor Decomposition . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.1 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1 Preliminary Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.1 MobileNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.2 Dataset CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.3 Teacher model and Layer-Sequential Unit-Variance initialization . . . 15 4.2 Proposed Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.1 Experimental procedures . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.1 Environmental settings . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.2 Experiment settings . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4.1 Model size and speed . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Future Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Appendix A. Tables of Model Structure . . . . . . . . . . . . . . . . . . . 28 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    [1] “Heatmapping deep taylor decompostion.” http://heatmapping.org/deeptaylor/.
    [2] D. Mishkin and J. Matas, “All you need is a good init,” CoRR, vol. abs/1511.06422, 2016.
    [3] Y. J. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” ArXiv, vol. abs/1710.09282, 2017.
    [4] S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, B. Catanzaro, and W. J. Dally, “Dsd: Dense-sparse-dense training for deep neural networks,” in ICLR, 2017.
    [5] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149, 2016.
    [6] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition,” CoRR, vol. abs/1412.6553, 2015.
    [7] C. Tai, T. Xiao, X. Wang, and E. Weinan, “Convolutional neural networks with low-rank regularization,” CoRR, vol. abs/1511.06067, 2016.
    [8] T. Cohen and M. Welling, “Group equivariant convolutional networks,” in ICML, 2016.
    [9] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet and the impact of residual
    connections on learning,” in AAAI, 2016.
    [10] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in NIPS, 2014.
    [11] C. Bucila, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in KDD, 2006.
    [12] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” ArXiv, vol. abs/
    1503.02531, 2015.
    [13] “MACHINE LEARNING Cheatsheet loss functions.” https://ml-cheatsheet.readthedocs. io/en/latest/loss_functions.html, 2017.
    [14] A. Binder, G. Montavon, S. Bach, K. Müller, and W. Samek, “Layer-wise relevance propagation for neural networks with local renormalization layers,” ArXiv, vol. abs/1604.00825, 2016.
    [15] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K. Müller, “Explaining nonlinear classification decisions with deep taylor decomposition,” Pattern Recognition, vol. 65, pp. 211–222, 2017.
    [16] M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K. T. Schütt, G. Montavon, W. Samek, K.-R. Müller, S. Dähne, and P.-J. Kindermans, “innvestigate neural networks!,” J. Mach. Learn. Res., vol. 20, pp. 93:1–93:8, 2018.
    [17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam,
    “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” ArXiv, vol. abs/ 1704.04861, 2017.32
    [18] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
    [19] D. 9, “David 9’s blog train a keras model on cifar-10 to accuracy 90%+.” http://nooverfit.com/
    wp/%E7%94%A8keras%E8%AE%AD%E7%BB%83%E4%B8%80%E4%B8%AA%E5%87%86%E7%A1%AE%E7%
    8E%8790%E7%9A%84cifar-10%E9%A2%84%E6%B5%8B%E6%A8%A1%E5%9E%8B/.
    [20] F. Chollet et al., “Keras.” https://keras.io, 2015

    無法下載圖示 全文公開日期 2024/08/22 (校內網路)
    全文公開日期 2024/08/22 (校外網路)
    全文公開日期 2024/08/22 (國家圖書館:臺灣博碩士論文系統)
    QR CODE