以知識蒸餾實現模型壓縮之分析｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	洪宇薇 Yu-Wei Hong
論文名稱：	以知識蒸餾實現模型壓縮之分析 Analysis of Model Compression Using Knowledge Distillation
指導教授：	呂政修 Jenq-Shiou Leu
口試委員:	周承復 Cheng-Fu Chou 林敬舜 Ching-Shun Lin 陳郁堂 Yie-Tarng Chen 方文賢 Wen-Hsien Fang
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	41
中文關鍵詞：	深度學習、模型壓縮、知識蒸餾、分析、視覺化、熱度圖
外文關鍵詞：	Deep Learning, Model Compression, Knowledge Distillation, analysis, visualization, heatmap
相關次數：	點閱：331 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著深度學習的發展，為了因應人類不同種的需求及情境，網路架構被設計的越來越龐大且複雜。這些複雜的網路架構往往造成使用者必須消耗一定的運算資源、記憶體空間，而且不夠即時。為了解決這些缺點，模型壓縮技術領域是一個值得研究的領域。而使用者必須知道如何根據自己的需求，選擇適合的模型壓縮技術及壓縮結果，以達到需求以及效能犧牲的平衡點。

這篇論文中，在已知需壓縮目標模型的情況下，我們提出了兩種模型壓縮方式：切通道深度、切模型層數。在設計出壓縮過後的模型架構之後，實施知識蒸餾以提昇壓縮模型的辨識率。最後，我們將會展示如何用不同角度分析模型壓縮的成果，並且針對結果提出一些取得效能與需求之間的平衡上的建議。在實驗結果中，MobileNet_v1若是以通道深度壓縮方式，可至少壓縮42.27%，若是以層數壓縮方式則至少可壓縮32.42%。除此之外，知識蒸餾所能提昇的準確率以通道深度壓縮方式中尤為有效(多於4.71%)。

In this paper, given a model to compress, we propose two kinds of model compression: cut the network width-wise and layer-wise. Afterwards, Knowledge Distillation is deployed to compensate and improve classifiers' accuracy. At the end, we also demonstrate how to analyze those compressed models from a variety of perspectives, and come up with several suggestions about the trade-off between performance (inference time and accuracy) and compression rate. In the results of experiments, the compression rate of width-wise compression on MobileNet_v1 is at least 42.27 %, whereas that of layer-wise compression is at least 32.42 %. Moreover, the improvement of accuracy between procedure with and without Knowledge Distillation is especially notable for layer-wise compression (more than 4.71 \%).

Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Objectives  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3 Research Scope  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4 Outline of All Chapters . . . . . . . . . . . . . . . . . . . . . . . . 2
Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Knowledge Distillation. . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1 Soft labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Procedures  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Deep Taylor Decomposition . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1 Preliminary Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1 MobileNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Dataset CIFAR-10  . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Teacher model and Layer-Sequential Unit-Variance initialization . . . 15
2 Proposed Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Experimental procedures . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Experiment Setup  . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Environmental settings  . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Experiment settings . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Model size and speed  . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 27
1 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Future Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Appendix A. Tables of Model Structure . . . . . . . . . . . . . . . . . . . 28
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
                                

[1] “Heatmapping deep taylor decompostion.” http://heatmapping.org/deeptaylor/.
[2] D. Mishkin and J. Matas, “All you need is a good init,” CoRR, vol. abs/1511.06422, 2016.
[3] Y. J. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” ArXiv, vol. abs/1710.09282, 2017.
[4] S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, B. Catanzaro, and W. J. Dally, “Dsd: Dense-sparse-dense training for deep neural networks,” in ICLR, 2017.
[5] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149, 2016.
[6] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition,” CoRR, vol. abs/1412.6553, 2015.
[7] C. Tai, T. Xiao, X. Wang, and E. Weinan, “Convolutional neural networks with low-rank regularization,” CoRR, vol. abs/1511.06067, 2016.
[8] T. Cohen and M. Welling, “Group equivariant convolutional networks,” in ICML, 2016.
[9] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet and the impact of residual
connections on learning,” in AAAI, 2016.
[10] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in NIPS, 2014.
[11] C. Bucila, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in KDD, 2006.
[12] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” ArXiv, vol. abs/
1503.02531, 2015.
[13] “MACHINE LEARNING Cheatsheet loss functions.” https://ml-cheatsheet.readthedocs. io/en/latest/loss_functions.html, 2017.
[14] A. Binder, G. Montavon, S. Bach, K. Müller, and W. Samek, “Layer-wise relevance propagation for neural networks with local renormalization layers,” ArXiv, vol. abs/1604.00825, 2016.
[15] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K. Müller, “Explaining nonlinear classification decisions with deep taylor decomposition,” Pattern Recognition, vol. 65, pp. 211–222, 2017.
[16] M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K. T. Schütt, G. Montavon, W. Samek, K.-R. Müller, S. Dähne, and P.-J. Kindermans, “innvestigate neural networks!,” J. Mach. Learn. Res., vol. 20, pp. 93:1–93:8, 2018.
[17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam,
“Mobilenets: Efficient convolutional neural networks for mobile vision applications,” ArXiv, vol. abs/ 1704.04861, 2017.32
[18] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
[19] D. 9, “David 9’s blog train a keras model on cifar-10 to accuracy 90%+.” http://nooverfit.com/
wp/%E7%94%A8keras%E8%AE%AD%E7%BB%83%E4%B8%80%E4%B8%AA%E5%87%86%E7%A1%AE%E7%
8E%8790%E7%9A%84cifar-10%E9%A2%84%E6%B5%8B%E6%A8%A1%E5%9E%8B/.
[20] F. Chollet et al., “Keras.” https://keras.io, 2015

全文公開日期 2024/08/22 (校內網路)
全文公開日期 2024/08/22 (校外網路)
全文公開日期 2024/08/22 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文