簡易檢索 / 詳目顯示

研究生: Tryan Aditya Putra
Tryan Aditya Putra
論文名稱: 利用極限梯度提升方法構建多層級神經網路以降低於邊 緣雲環境中的推理時間
Utilizing Extreme Gradient Boosting Method for Constructing Multilevel Neural Network to Reduce the Inference Time in an Edge-Cloud Environment
指導教授: 呂政修
Jenq-Shiou Leu
口試委員: Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
Jiann-Liang Che
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 113
中文關鍵詞: Machine LearningCompressionAcceleration
外文關鍵詞: Machine Learning, Compression, Acceleration
相關次數: 點閱:201下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

In recent years, Deep Neural Network (DNN) has been taking the spotlight since the
rise of AlexNet and ImageNet competition. Many years went by, the increasing needs
to have larger architecture is starting to appear. Researcher becomes more aggressive to
create a massive one billion network parameters just to increase the performance by few
percents. As the growth of DNN architecture becomes nearly exponential, computation
devices start to follow the trend. The growth from communication stands point enables
smart application directly from the edge devices. Thus the need for having a compact and
efficient DNN model to be deployed on edge devices. Several techniques to compress
the network starting to appear in order to cope with the exponential growth of DNN.
Generally, all the technique for model compression and acceleration can be categorized as
model pruning, quantization, low-rank, knowledge distillation, and branchy-like network
architecture. Each of the technique has their own advantages and disadvantages.
In this dissertation we propose a new technique that are able to both speed-up and
reduce DNN model deployment. The idea is to use the edge-cloud layering environment
to put different network on each layer. The lightweight machine learning will be needed
to filter which data that needs to pass, and which are not. This idea will help DNN
with diminishing return of its capability. Which means that it needs exponentially larger
network to have linear accuracy growth. We compared our technique with various stateof-the-art approaches in the all compression and acceleration categories. We are able to
show our technique superiority based on Cifar10 and ImageNet dataset.


In recent years, Deep Neural Network (DNN) has been taking the spotlight since the
rise of AlexNet and ImageNet competition. Many years went by, the increasing needs
to have larger architecture is starting to appear. Researcher becomes more aggressive to
create a massive one billion network parameters just to increase the performance by few
percents. As the growth of DNN architecture becomes nearly exponential, computation
devices start to follow the trend. The growth from communication stands point enables
smart application directly from the edge devices. Thus the need for having a compact and
efficient DNN model to be deployed on edge devices. Several techniques to compress
the network starting to appear in order to cope with the exponential growth of DNN.
Generally, all the technique for model compression and acceleration can be categorized as
model pruning, quantization, low-rank, knowledge distillation, and branchy-like network
architecture. Each of the technique has their own advantages and disadvantages.
In this dissertation we propose a new technique that are able to both speed-up and
reduce DNN model deployment. The idea is to use the edge-cloud layering environment
to put different network on each layer. The lightweight machine learning will be needed
to filter which data that needs to pass, and which are not. This idea will help DNN
with diminishing return of its capability. Which means that it needs exponentially larger
network to have linear accuracy growth. We compared our technique with various stateof-the-art approaches in the all compression and acceleration categories. We are able to
show our technique superiority based on Cifar10 and ImageNet dataset.

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Fundamentals of Deep Neural Network . . . . . . . . . . . . . . . . . . 1 1.1.1 Introduction of DNN . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 DNN for Image Recognition . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 DNN Evolution for Image Recognition . . . . . . . . . . . . . . 8 1.2.2 DNN Compression and Optimization . . . . . . . . . . . . . . . 9 1.3 Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Fully Connected Layer Pruning . . . . . . . . . . . . . . . . . . 15 2.1.2 CNN Layer Pruning . . . . . . . . . . . . . . . . . . . . . . . . 18 v 2.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Quantization during Training . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Quantization for Inference . . . . . . . . . . . . . . . . . . . . . 21 2.3 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Low-rank Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Branchy-Type Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Stochastic Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7.1 CIFAR10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7.2 Imagenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3 Multilevel Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1 Decider in Multilevel Neural Network . . . . . . . . . . . . . . . . . . . 43 3.1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1.2 Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Initial Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Initial Accuracy on CIFAR-10 . . . . . . . . . . . . . . . . . . . 49 4.2.2 Initial Accuracy on Imagenet . . . . . . . . . . . . . . . . . . . . 50 5 Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 Speedup Analysis on CIFAR-10 Dataset . . . . . . . . . . . . . . . . . . 52 vi 5.2 Speedup Analysis on Imagenet Dataset . . . . . . . . . . . . . . . . . . . 56 5.3 Performance Comparison with Pruning Approach . . . . . . . . . . . . . 59 5.3.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3.2 Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Performance Comparison with Knowledge Distillation Approach . . . . . 67 5.4.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.2 Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5 Performance Comparison with Low-Rank Method . . . . . . . . . . . . . 71 5.5.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.6 Performance Comparison with Branchy-Type Network . . . . . . . . . . 74 5.6.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.6.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.7 Portfolio Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.8 Entropy vs Boosting Performance . . . . . . . . . . . . . . . . . . . . . 79 5.9 Trade-off Between Accuracy Drop and Speedup . . . . . . . . . . . . . . 82 6 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

References
[1] Lei Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Proceedings of the 27th International Conference on Neural Information Processing Systems
- Volume 2, NIPS’14, page 2654–2662, Cambridge, MA, USA, 2014. MIT Press.
[2] Alexander Bain. Mind and body. The theories of their relation. New York
: D. Appleton and company, 1873. URL http://archive.org/details/
mindbodytheories00bain.
[3] Sabrina Narimene Benassou, Wuzhen Shi, and Feng Jiang. Entropy guided adversarial model for weakly supervised object localization. Neurocomputing, 429:60–68,
2021.
[4] Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In Proceedings of the 14th International Conference on Discovery Science,
DS’11, page 1, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 9783642244766.
[5] Monica Bianchini and Franco Scarselli. On the complexity of neural network
classifiers: A comparison between shallow and deep architectures. Neural Networks and Learning Systems, IEEE Transactions on, 25:1553–1565, 08 2014. doi:
10.1109/TNNLS.2013.2293637.
[6] N. K. Bose and P. Liang. Neural Network Fundamentals with Graphs, Algorithms,
and Applications. McGraw-Hill, Inc., USA, 1996. ISBN 0070066183.
[7] L. Brigato, Bjorn Barz, L. Iocchi, and Joachim Denzler. Tune it or don’t use it: ¨
Benchmarking data-efficient image classification. ArXiv, abs/2108.13122, 2021.
[8] Cristian Buciluundefined, Rich Caruana, and Alexandru Niculescu-Mizil. Model
compression. In Proceedings of the 12th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York,
NY, USA, 2006. Association for Computing Machinery. ISBN 1595933395. doi: 10.
1145/1150402.1150464. URL https://doi.org/10.1145/1150402.1150464.
86
[9] Sangdoo Yun Jin Young Choi Byeongho Heo, Minsik Lee. Knowledge distillation
with adversarial samples supporting decision boundary. In AAAI Conference on
Artificial Intelligence (AAAI), 2019.
[10] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker.
Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 742–751, Red Hook, NY, USA, 2017. Curran Associates Inc.
ISBN 9781510860964.
[11] Shi Chen and Qi Zhao. Shallowing deep networks: Layer-wise pruning based on
feature representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12):3048–3056, 2019. doi: 10.1109/TPAMI.2018.2874634.
[12] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In
Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016.
ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL http:
//doi.acm.org/10.1145/2939672.2939785.
[13] Tu Chengsheng, Liu Huacheng, and Xu Bing. Adaboost typical algorithm and its
application research. MATEC Web of Conferences, 139:00222, 01 2017. doi: 10.
1051/matecconf/201713900222.
[14] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. ´
Autoaugment: Learning augmentation strategies from data. In IEEE Conference
on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA,
June 16-20, 2019, pages 113–123. Computer Vision Foundation / IEEE, 2019. doi:
10.1109/CVPR.2019.00020. URL http://openaccess.thecvf.com/content\
_CVPR\_2019/html/Cubuk\_AutoAugment\_Learning\_Augmentation\
_Strategies\_From\_Data\_CVPR\_2019\_paper.html.
[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
87
[16] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th
Annual Conference on Learning Theory, volume 49 of Proceedings of Machine
Learning Research, pages 907–940, Columbia University, New York, New York,
USA, 23–26 Jun 2016. PMLR. URL https://proceedings.mlr.press/v49/
eldan16.html.
[17] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene
categories. In 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), volume 2, pages 524–531 vol. 2, 2005. doi: 10.
1109/CVPR.2005.16.
[18] Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang,
Dmitry Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for
residual networks. pages 1790–1799, 07 2017. doi: 10.1109/CVPR.2017.194.
[19] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 1998.
[20] Song Han, Jeff Pool, John Tran, and William Dally. Learning both
weights and connections for efficient neural network. In C. Cortes,
N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances
in Neural Information Processing Systems, volume 28. Curran Associates,
Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/
ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf.
[21] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing
deep neural network with pruning, trained quantization and huffman coding. In
Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning
Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference
Track Proceedings, 2016. URL http://arxiv.org/abs/1510.00149.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/
abs/1512.03385.
88
[23] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning
for accelerating deep convolutional neural networks. In Proceedings of the 27th
International Joint Conference on Artificial Intelligence, IJCAI’18, pages 2234–
2240. AAAI Press, 2018. ISBN 978-0-9992411-2-7. URL http://dl.acm.org/
citation.cfm?id=3304889.3304970.
[24] Donald O. Hebb. The Organization of Behavior: A Neuropsychological Theory.
Wiley, New York, 1949.
[25] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a
neural network. In NIPS Deep Learning and Representation Learning Workshop,
2015. URL http://arxiv.org/abs/1503.02531.
[26] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a
neural network. In NIPS Deep Learning and Representation Learning Workshop,
2015. URL http://arxiv.org/abs/1503.02531.
[27] Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, Yukun Zhu, Ruoming Pang, Hartwig Adam,
and Quoc Le. Searching for mobilenetv3. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1314–1324, 2019. doi: 10.1109/ICCV.
2019.00140.
[28] K. Kawaguchi. A Multithreaded Software Model for Backpropagation Neural Network Applications. University of Texas at El Paso, 2000. URL https://books.
google.co.id/books?id=QVkftwAACAAJ.
[29] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient
gradient boosting decision tree. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3146–3154.
Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/
6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.
pdf.
89
[30] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.
Master’s thesis, Department of Computer Science, University of Toronto, 2009.
[31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. In F. Pereira,
C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances
in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks.
pdf.
[32] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In 2006 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2,
pages 2169–2178, 2006. doi: 10.1109/CVPR.2006.68.
[33] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi:
10.1109/5.726791.
[34] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning
filters for efficient convnets. CoRR, abs/1608.08710, 2016. URL http://arxiv.
org/abs/1608.08710.
[35] Shaohui Lin, Rongrong Ji, Chao Chen, Dacheng Tao, and Jiebo Luo. Holistic cnn compression via low-rank decomposition with knowledge transfer. IEEE
Transactions on Pattern Analysis and Machine Intelligence, PP:1–1, 10 2018. doi:
10.1109/TPAMI.2018.2873305.
[36] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan L.
Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search.
CoRR, abs/1712.00559, 2017. URL http://arxiv.org/abs/1712.00559.
[37] Ziyin Liu, Zhikang Wang, Paul Pu Liang, Russ R Salakhutdinov, Louis-Philippe
Morency, and Masahito Ueda. Deep gamblers: Learning to abstain with portfolio
theory. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and ´
90
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32.
Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/
2019/file/0c4b1eeb45c90b52bfb9d07943d855ab-Paper.pdf.
[38] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri,
Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of
weakly supervised pretraining, 2018. URL http://arxiv.org/abs/1805.00932.
cite arxiv:1805.00932Comment: Technical report.
[39] Bc. Luka´vs. Majer. Object localization by spatial matching of high-level cnn features. 2021.
[40] Marina Adriana Mercioni and Stefan Holban. P-swish: Activation function with
learnable parameters based on swish activation function in deep learning. In 2020
International Symposium on Electronics and Telecommunications (ISETC), pages
1–4, 2020. doi: 10.1109/ISETC50328.2020.9301059.
[41] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38
(11):39–41, November 1995. ISSN 0001-0782. doi: 10.1145/219717.219748. URL
https://doi.org/10.1145/219717.219748.
[42] Asit Mishra and Debbie Marr. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. In International Conference
on Learning Representations, 2018. URL https://openreview.net/forum?id=
B1ae1lZRb.
[43] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning
convolutional neural networks for resource efficient transfer learning. 11 2016.
[44] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic
representation of the spatial envelope. International Journal of Computer Vision,
42:145–175, 2001.
[45] Hieu Pham, Qizhe Xie, Zihang Dai, and Quoc V. Le. Meta pseudo labels. CoRR,
abs/2003.10580, 2020. URL https://arxiv.org/abs/2003.10580.
91
[46] Zhi pin Nie, Ying Lin, Sp Ren, and Lan Zhang. Adaptive perturbation adversarial
training: based on reinforcement learning. ArXiv, abs/2108.13239, 2021.
[47] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: Unbiased boosting with categorical features.
In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 6639–6649, Red Hook, NY, USA, 2018. Curran
Associates Inc.
[48] Yunxiao Qin, Yuanhao Xiong, Jinfeng Yi, and Cho-Jui Hsieh. Training metasurrogate model for transferable adversarial attack. 2021.
[49] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo
Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550,
2014.
[50] F. Rosenblatt. The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, pages 65–386, 1958.
[51] Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana
Ramabhadran. Low-rank matrix factorization for deep neural network training with
high-dimensional output targets. In in Proc. IEEE Int. Conf. on Acoustics, Speech
and Signal Processing, 2013.
[52] Yuzhang Shang, Bin Duan, Ziliang Zong, Liqiang Nie, and Yan Yan. Lipschitz
continuity guided knowledge distillation. ArXiv, abs/2108.12905, 2021.
[53] Jonathan Shen, Noranart Vesdapunt, Vishnu Naresh Boddeti, and Kris M. Kitani.
In teacher we trust: Learning compressed models for pedestrian detection. CoRR,
abs/1612.00478, 2016. URL http://arxiv.org/abs/1612.00478.
[54] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.
org/abs/1409.1556.
[55] Alexander Sorokin and David Forsyth. Utility data annotation with amazon mechanical turk. In 2008 IEEE Computer Society Conference on Computer Vision
92
and Pattern Recognition Workshops, pages 1–8, 2008. doi: 10.1109/CVPRW.2008.
4562953.
[56] Suraj Srinivas and R. Venkatesh Babu. Data-free parameter pruning for deep neural
networks. CoRR, abs/1507.06149, 2015. URL http://dblp.uni-trier.de/db/
journals/corr/corr1507.html#SrinivasB15.
[57] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–
9, June 2015. doi: 10.1109/CVPR.2015.7298594. URL https://ieeexplore.
ieee.org/document/7298594.
[58] Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with low-rank regularization. CoRR, abs/1511.06067, 2015. URL
http://arxiv.org/abs/1511.06067.
[59] Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, volume 97
of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 09–15
Jun 2019. URL https://proceedings.mlr.press/v97/tan19a.html.
[60] Mingxing Tan, Ruoming Pang, and Quoc V. Le. Efficientdet: Scalable and efficient
object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10778–10787, 2020. doi: 10.1109/CVPR42600.2020.
01079.
[61] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck
principle. CoRR, abs/1503.02406, 2015. URL http://arxiv.org/abs/1503.
02406.
[62] Antonio Torralba, Rob Fergus, and William T. Freeman. 80 million tiny images: A
large data set for nonparametric object and scene recognition. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008. doi: 10.1109/
TPAMI.2008.128.
93
[63] Aditay Tripathi, Rajath R Dani, Anand Mishra, and Anirban Chakraborty. Sketchguided object localization in natural images. ArXiv, abs/2008.06551, 2020.
[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.
cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[65] Chaofei Wang, Jiayu Xiao, Yizeng Han, Qisen Yang, Shiji Song, and Gao Huang.
Towards learning spatially discriminative feature representations. 2021.
[66] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet:
Learning dynamic routing in convolutional networks. In Vittorio Ferrari, Martial
Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV
2018, pages 420–436, Cham, 2018. Springer International Publishing. ISBN 978-3-
030-01261-8.
[67] Longhui Wei, An Xiao, Lingxi Xie, Xin Chen, Xiaopeng Zhang, and Qi Tian.
Circumventing outliers of autoaugment with knowledge distillation. CoRR,
abs/2003.11342, 2020. URL https://arxiv.org/abs/2003.11342.
[68] Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust finetuning of zero-shot models. 2021.
[69] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis,
Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In CVPR, 2018.
[70] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with
noisy student improves imagenet classification. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
94
[71] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified
activations in convolutional network. CoRR, abs/1505.00853, 2015. URL http:
//arxiv.org/abs/1505.00853.
[72] J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 7130–7138, July 2017.
doi: 10.1109/CVPR.2017.754.
[73] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable
neural networks. In International Conference on Learning Representations, 2019.
URL https://openreview.net/forum?id=H1gMCsAqY7.
[74] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In
ICLR, 2017. URL https://arxiv.org/abs/1612.03928.
[75] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. CoRR, abs/2106.04560, 2021. URL https://arxiv.org/abs/
2106.04560.
[76] Xinjie Zhang, Jiawei Shao, Yuyi Mao, and Jun Zhang. Communication-computation
efficient device-edge co-inference via automl. ArXiv, abs/2108.13009, 2021.
[77] Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. Deep mutual
learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 4320–4328, 2017.
[78] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense
network for image super-resolution. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2018.
[79] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental
network quantization: Towards lossless cnns with low-precision weights. In 5th
International Conference on Learning Representations, ICLR 2017, Toulon, France,
April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL
https://openreview.net/forum?id=HyQJ-mclg.
95
[80] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo,
Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discrimination-aware
channel pruning for deep neural networks. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems 31, pages 875–886.
Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/
7367-discrimination-aware-channel-pruning-for-deep-neural-networks.
pdf.
[81] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017.
URL http://arxiv.org/abs/1707.07012.

QR CODE