簡易檢索 / 詳目顯示

研究生: Jules SALZINGER
Jules SALZINGER
論文名稱: 軟知識移轉研究 : 探討啟發式與模型無關之權重稀疏方法
Soft Knowledge Relocation: toward heuristic-agnostic model pruning
指導教授: 呂政修
Jenq-Shiou Leu
陳郁堂
Yie-Tarng Chen
口試委員: 方文賢
Wen-Hsien Fang
鄭瑞光
Ray-Guang Cheng
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 120
中文關鍵詞: model compressiondeep learningnode pruninginformation theoryheuristicssurvey
外文關鍵詞: model compression, deep learning, node pruning, information theory, heuristics, survey
相關次數: 點閱:316下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

The appeal of Deep Learning solutions from an industrial perspective is currently hindered by the difficulty to adapt those models to the constraints of the terrain. Compression algorithms have existed for a very long time to fit that purpose. They consist in finding a new model, smaller than the original one, but which retains most of its inferential power. However, for a number of reasons, they have not been sufficient to make Deep Learning accessible to small companies. In this study, our goal is to open up a new domain of research dedicated to finding heuristic-agnostic compression methods to solve this problem. We start by analysing the current literature and conclude that the quest for better heuristics, that has been the focus of most research up to this day, is only one part of the problem. The other part, namely trying to reduce the influence of the heuristic on the compression algorithm, has been mostly ignored and could constitute an interesting direction in the future. Then, we determine a sufficient condition ensuring heuristic agnosticism in a compression algorithm. From there, we design an algorithm that makes use of this theory and test it to compress the I3D network\cite{i3d}. Even though our method is still at an early stage of development, we observe that it can lower the variance of the baseline algorithm by about 60\% all the while improving its performance.


The appeal of Deep Learning solutions from an industrial perspective is currently hindered by the difficulty to adapt those models to the constraints of the terrain. Compression algorithms have existed for a very long time to fit that purpose. They consist in finding a new model, smaller than the original one, but which retains most of its inferential power. However, for a number of reasons, they have not been sufficient to make Deep Learning accessible to small companies. In this study, our goal is to open up a new domain of research dedicated to finding heuristic-agnostic compression methods to solve this problem. We start by analysing the current literature and conclude that the quest for better heuristics, that has been the focus of most research up to this day, is only one part of the problem. The other part, namely trying to reduce the influence of the heuristic on the compression algorithm, has been mostly ignored and could constitute an interesting direction in the future. Then, we determine a sufficient condition ensuring heuristic agnosticism in a compression algorithm. From there, we design an algorithm that makes use of this theory and test it to compress the I3D network\cite{i3d}. Even though our method is still at an early stage of development, we observe that it can lower the variance of the baseline algorithm by about 60\% all the while improving its performance.

I) Introduction II) Motivations : an overview of general model compression III) Related works : a survey of current pruning methods IV) The proposed method : Soft Knowledge Relocation V) Experiments VI) Discussion and future works VII) Conclusion

[1] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
[2] A. Berthelier, P. Phutane, C. Blanc, S. Duffner, T. Chateau, and C. Garcia, “Deep model compression for mobile devices : A survey,” 2019.
[3] M. Augasta and T. Kathirvalavakumar, “Pruning algorithms of neural networks
— a comparative study,” Central European Journal of Computer Science, vol. 3, pp. 105–115, 09 2013.
[4] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[5] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning con- volutional neural networks for resource efficient transfer learning,” CoRR, vol. abs/1611.06440, 2016.
[6] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems 2 (D. S. Touretzky, ed.), pp. 598–605, Morgan-Kaufmann, 1990.
[7] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in Neural Information Processing Systems 5 (S. J. Hanson, J. D. Cowan, and C. L. Giles, eds.), pp. 164–171, Morgan- Kaufmann, 1993.
[8] M. C. Mozer and P. Smolensky, “Skeletonization: A technique for trimming the fat from a network via relevance assessment,” in Advances in Neural In- formation Processing Systems 1 (D. S. Touretzky, ed.), pp. 107–115, Morgan- Kaufmann, 1989.
[9] E. D. Karnin, “A simple procedure for pruning back-propagation trained neural networks,” IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 239–242, 1990.
[10] G. Castellano, A. M. Fanelli, and M. Pelillo, “An iterative pruning algorithm for feedforward neural networks,” IEEE transactions on Neural networks, vol. 8, no. 3, pp. 519–531, 1997.
[11] H. Xing and B. Hu, “Two-phase construction of multilayer perceptrons using information theory,” IEEE Transactions on Neural Networks, vol. 20, no. 4, pp. 715–721, 2009.
[12] Z. Zhang and J. Qiao, “A node pruning algorithm for feedforward neural net- work based on neural complexity,” in 2010 International Conference on Intel- ligent Control and Information Processing, pp. 406–410, 2010.
[13] A. P. Engelbrecht, “A new pruning heuristic based on variance analysis of sensitivity information,” IEEE transactions on Neural Networks, vol. 12, no. 6, pp. 1386–1399, 2001.
[14] M. G. Augasta and T. Kathirvalavakumar, “A novel pruning algorithm for optimizing feedforward neural network of classification problems,” Neural pro- cessing letters, vol. 34, no. 3, p. 241, 2011.
[15] B. Hanin and M. Sellke, “Approximating continuous functions by relu nets of minimal width.,” CoRR, vol. abs/1710.11278, 2017.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information pro- cessing systems, pp. 1097–1105, 2012.
[17] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V. Le, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” CoRR, vol. abs/1811.06965, 2018.
[18] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” CoRR, vol. abs/1905.11946, 2019.
[19] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
[20] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019.
[21] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Al- bert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
[22] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
[23] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recogni- tion,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
[24] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol. 1, pp. I–I, IEEE, 2001.
[25] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
[26] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
[27] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural net- works for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017.
[28] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
[29] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural ordinary differential equations,” CoRR, vol. abs/1806.07366, 2018.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog- nition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
[31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Medical Image Computing and Computer- Assisted Intervention – MICCAI 2015, p. 234–241, 2015.
[32] A. Gaier and D. Ha, “Weight agnostic neural networks,” 2019. https:
//weightagnostic.github.io.
[33] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015.
[34] K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network compression.,” in ICLR (Poster), OpenReview.net, 2017.
[35] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “Model compression and accelera- tion for deep neural networks: The principles, progress, and challenges,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126–136, 2018.
[36] Y. Teng and A. Choromanska, “Invertible autoencoder for domain adaptation,”
Computation, vol. 7, no. 2, p. 20, 2019.
[37] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neu- ral network with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149, 2016.
[38] W. Tang, G. Hua, and L. Wang, “How to train a compact binary neural net- work with high accuracy?,” in Thirty-First AAAI Conference on Artificial In- telligence, 2017.
[39] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integers in deep neural networks,” CoRR, vol. abs/1802.04680, 2018.
[40] T. A. Putra and J. Leu, “Multilevel neural network for reducing expected in- ference time,” IEEE Access, vol. 7, pp. 174129–174138, 2019.
[41] D. Whitley, “The evolution of connectivity: Pruning neural networks using genetic algorithm,” in Proceedings of IJCNN-90, pp. 134–137, 1990.
[42] C. Yang, Z. An, C. Li, B. Diao, and Y. Xu, “Multi-objective pruning for cnns using genetic algorithm,” Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning, p. 299–305, 2019.
[43] Y. Hu, S. Sun, J. Li, X. Wang, and Q. Gu, “A novel channel pruning method for deep neural network compression,” CoRR, vol. abs/1805.11394, 2018.
[44] M. Hagiwara, “A simple and effective method for removal of hidden units and weights,” Neurocomputing, vol. 6, pp. 207–218, 1994.
[45] B. E. Segee and M. J. Carter, “Fault tolerance of pruned multilayer networks,” in IJCNN-91-Seattle International Joint Conference on Neural Networks, vol. 2, pp. 447–452, IEEE, 1991.
[46] F. Ai, “A new pruning algorithm for feedforward neural networks,” in The Fourth International Workshop on Advanced Computational Intelligence, pp. 286–289, 2011.
[47] Q. Xie, E. H. Hovy, M.-T. Luong, and Q. V. Le, “Self-training with noisy student improves imagenet classification.,” CoRR, vol. abs/1911.04252, 2019.
[48] C. E. Shannon, “A mathematical theory of communication,” Bell system tech- nical journal, vol. 27, no. 3, pp. 379–423, 1948.
[49] I. Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed, and A. Lerchner, “Early visual concept learning with unsupervised deep learn- ing.,” CoRR, vol. abs/1606.05579, 2016.
[50] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. Le- Cun, “Disentangling factors of variation in deep representation using adversarial training,” in Advances in Neural Information Processing Systems 29 (D. D. Lee,
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds.), pp. 5040–5048, Curran Associates, Inc., 2016.
[51] D. Bouchacourt, R. Tomioka, and S. Nowozin, “Multi-level variational autoen- coder: Learning disentangled representations from grouped observations,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[52] E. L. Denton and v. Birodkar, “Unsupervised learning of disentangled represen- tations from video,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 4414–4423, Curran Associates, Inc., 2017.
[53] V. John, L. Mou, H. Bahuleyan, and O. Vechtomova, “Disentangled represen- tation learning for text style transfer,” CoRR, vol. abs/1808.04339, 2018.
[54] Q. Wang, Statistical Models for Human Motion Synthesis. Theses, Ecole Cen- trale Marseille, July 2018.
[55] J. Rowley, “The wisdom hierarchy: representations of the dikw hierarchy,”
Journal of Information Science, vol. 33, no. 2, pp. 163–180, 2007.
[56] S. Kullback, Information theory and statistics. Courier Corporation, 1997.
[57] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning for video understanding,” CoRR, vol. abs/1712.04851, 2017.
[58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
[59] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012.
[60] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in 2011 International Confer- ence on Computer Vision, pp. 2556–2563, IEEE, 2011.
[61] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan,
F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The kinetics human action video dataset,” CoRR, vol. abs/1705.06950, 2017.
[62] R. Piziak and P. Odell, “Full rank factorization of matrices,” Mathematics magazine, vol. 72, no. 3, pp. 193–201, 1999.
[63] M. Banerjee and N. R. Pal, “Feature selection with svd entropy: Some modifi- cation and extension,” Information Sciences, vol. 264, pp. 118–134, 2014.
[64] C. L. Sabharwal and B. Anjum, “An svd-entropy and bilinearity based product ranking algorithm using heterogeneous data,” Journal of Visual Languages & Computing, vol. 41, pp. 133–141, 2017.
[65] L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psy- chometrika, vol. 31, no. 3, pp. 279–311, 1966.
[66] M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1/2, pp. 81–93, 1938.
[67] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich, “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” CoRR, vol. abs/1711.02257, 2017.
[68] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,”
The Journal of Machine Learning Research, vol. 13, no. 1, pp. 281–305, 2012.

QR CODE