研究生: |
Tomas Zamostny Tomas Zamostny |
---|---|
論文名稱: |
適用於輕量級深度網路動作辨識之高效率參數遷移學習研究 Exploring Parameter-Efficient Transfer Learning for Action Recognition in Lightweight Deep Networks |
指導教授: |
陳郁堂
Yie-Tarng Chen |
口試委員: |
陳郁堂
Yie-Tarng Chen 林銘波 Ming-Bo Lin 陳省隆 Hsing-Lung Chen 方文賢 Wen-Hsien Fang 呂政修 Jenq-Shiou Leu |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2023 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 48 |
外文關鍵詞: | Vision Transformers, Edge Device, Surveillance, Action Recognition, Parameter-Efficient Transfer Learning |
相關次數: | 點閱:34 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文提出了一種在邊緣設備上部署人類活動識別(HAR)系統的新方法,這對於物聯網(IoT)至關重要。 它解決了基於變壓器的模型的大尺寸和記憶體需求的挑戰,這對於邊緣部署來說是不切實際的。 該解決方案是一個平衡高性能與最少參數數量的模型。 該模型透過採用 MobileViT 架構進行創新,該架構在空間特徵提取方面表現出色。 透過整合 ST 轉接器,修改後的架構可以處理影片以捕捉時間動態,而無需重新訓練整個 MobileViT(仍保持預先訓練狀態)。 這項策略性修改使模型能夠專注於 HAR 任務,同時保持足夠小的尺寸以適應邊緣設備。 Kinetics-400資料集上的效能評估表明,該模型僅用530萬個參數即可實現74.94%的準確率,其中15%在訓練期間更新。 在 Jetson Nano 上進行即時處理時,模型以每秒 16.45 幀的速度運行,使用 2.58 GB RAM,預測精度保持在 71.07%。
This thesis proposes a novel approach for deploying Human Activity Recognition (HAR) systems on edge devices, crucial for the Internet of Things (IoT). It addresses the challenge of the large size and memory requirements of transformer-based models, which are impractical for edge deployment. The solution is a model that balances high performance with a minimum number of parameters. The model innovates by adapting the MobileViT architecture, which excels in spatial feature extraction. By incorporating ST adapters, the modified architecture can process video to capture temporal dynamics without retraining the entire MobileViT, which remains pre-trained. This strategic modification allows the model to specialise in HAR tasks while remaining small enough for edge devices. Performance evaluations on the Kinetics-400 dataset show that the model achieves 74.94% accuracy with only 5.3 million parameters, of which 15% are updated during training. In real-time processing on a Jetson Nano, the model runs at 16.45 frames per second, using 2.58 GB of RAM, and maintains a prediction accuracy of 71.07%.
[1] Jetson Nano Developer Kit. Available from: https://developer.nvidia.com/
embedded/jetson-nano-developer-kit
[2] vaishnav, R. Visualizing Feature Maps using PyTorch. June 2021. Available
from: https://ravivaishnav20.medium.com/visualizing-feature-maps-usingpytorch-
12a48cd1e573
[3] Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention is all you need. Advances in
neural information processing systems, volume 30, 2017.
[4] Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; et al. An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[5] Mehta, S.; Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly
vision transformer. arXiv preprint arXiv:2110.02178, 2021.
[6] Tsang, S.-H. Review: MobileNetV2 — Light Weight Model (Image Classification).
Aug. 2019. Available from: https://towardsdatascience.com/reviewmobilenetv2-
light-weight-model-image-classification-8febb490e61c
[7] S¨uzen, A. A.; Duman, B.; S¸en, B. Benchmark analysis of jetson tx2, jetson nano and
raspberry pi using deep-cnn. In 2020 International Congress on Human-Computer
Interaction, Optimization and Robotic Applications (HORA), IEEE, 2020, pp. 1–5.
[8] Fu, J.; Rui, Y. Advances in deep learning approaches for image tagging. APSIPA
Transactions on Signal and Information Processing, volume 6, 2017: p. e11.
[9] Zhang, S.; Callaghan, V. Real-time human posture recognition using an adaptive hybrid
classifier. International Journal of Machine Learning and Cybernetics, volume 12,
2021: pp. 489–499.
[10] Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. International
Journal of Computer Vision, volume 130, no. 5, 2022: pp. 1366–1401.
[11] Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; et al. Parameter-efficient transfer learning for
NLP. In International Conference on Machine Learning, PMLR, 2019, pp. 2790–2799.
[12] Krishnasamy, E.; Varrette, S.; Mucciardi, M. Edge Computing: An overview of framework
and applications. 2020.
[13] Ullah, S.; Kim, D.-H. Benchmarking Jetson platform for 3D point-cloud and hyperspectral
image classification. In 2020 IEEE International conference on big data and
smart computing (BigComp), IEEE, 2020, pp. 477–482.
[14] Mittal, S. A Survey on optimized implementation of deep learning models on the
NVIDIA Jetson platform. Journal of Systems Architecture, volume 97, 2019: pp. 428–
442.
[15] Pietschmann, C. ‘Raspberry Pi 4 vs NVIDIA Jetson Nano Developer Kit. ht
tps://build5nines. com/raspberry-pi-4-vs-nvidia-jetson-nano-dev e loper-kit, 2019.
[16] Aggarwal, C. C.; et al. Neural networks and deep learning. Springer, volume 10, no.
978, 2018: p. 3.
[17] Breiman, L. Statistical modeling: The two cultures (with comments and a rejoinder
by the author). Statistical science, volume 16, no. 3, 2001: pp. 199–231.
[18] Jordan, M. I.; Mitchell, T. M. Machine learning: Trends, perspectives, and prospects.
Science, volume 349, no. 6245, 2015: pp. 255–260.
[19] LeCun, Y.; Bottou, L.; Bengio, Y.; et al. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, volume 86, no. 11, 1998: pp. 2278–2324.
[20] Bottou, L.; Bengio, Y.; Le Cun, Y. Global training of document processing systems
using graph transformer networks. In proceedings of IEEE computer society conference
on computer vision and pattern recognition, IEEE, 1997, pp. 489–494.
[21] LeCun, Y.; Bengio, Y.; et al. Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, volume 3361, no. 10, 1995:
p. 1995.
[22] LeCun, Y.; Bottou, L.; Bengio, Y.; et al. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, volume 86, no. 11, 1998: pp. 2278–2324.
[23] McCulloch, W. S.; Pitts, W. A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, volume 5, 1943: pp. 115–133.
[24] Gallant, S. I.; et al. Perceptron-based learning algorithms. IEEE Transactions on
neural networks, volume 1, no. 2, 1990: pp. 179–191.
[25] Dubey, S. R.; Singh, S. K.; Chaudhuri, B. B. A comprehensive survey and performance
analysis of activation functions in deep learning. arXiv preprint arXiv:2109.14545,
2021.
[26] Kashyap, A. Math behind Perceptrons. Nov. 2019. Available from: https://
medium.com/@iamask09/math-behind-perceptrons-7241d5dadbfc
[27] Almeida, L. B. Multilayer perceptrons. In Handbook of Neural Computation, CRC
Press, 2020, pp. C1–2.
[28] Sayad, S. “Artificial Neural Network- Perceptron.
[29] Abadi, M.; Agarwal, A.; Barham, P.; et al. Tensorflow: Large-scale machine learning
on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
[30] Paszke, A.; Gross, S.; Massa, F.; et al. Pytorch: An imperative style, high-performance
deep learning library. Advances in neural information processing systems, volume 32,
2019.
[31] Harris, C. R.; Millman, K. J.; Van Der Walt, S. J.; et al. Array programming with
NumPy. Nature, volume 585, no. 7825, 2020: pp. 357–362.
[32] Paszke, A.; Gross, S.; Chintala, S.; et al. Automatic differentiation in PyTorch.(2017).
2017.
[33] O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv preprint
arXiv:1511.08458, 2015.
[34] Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional
neural networks. Advances in neural information processing systems, volume 25,
2012.
[35] Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
[36] Szegedy, C.; Liu, W.; Jia, Y.; et al. Going deeper with convolutions. In Proceedings of
the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[37] He, K.; Zhang, X.; Ren, S.; et al. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,
pp. 770–778.
[38] Goodfellow, I.; Bengio, Y.; Courville, A. Deep learning. MIT press, 2016.
[39] LeCun, Y.; Boser, B.; Denker, J. S.; et al. Backpropagation applied to handwritten
zip code recognition. Neural computation, volume 1, no. 4, 1989: pp. 541–551.
[40] Lee, H.; Grosse, R.; Ranganath, R.; et al. Unsupervised learning of hierarchical representations
with convolutional deep belief networks. Communications of the ACM,
volume 54, no. 10, 2011: pp. 95–103.
[41] Deng, J.; Dong, W.; Socher, R.; et al. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and pattern recognition, Ieee,
2009, pp. 248–255.
[42] Kay, W.; Carreira, J.; Simonyan, K.; et al. The kinetics human action video dataset.
arXiv preprint arXiv:1705.06950, 2017.
[43] Google Scholar. Available from: https://scholar.google.com/schhp?hl=
en&as sdt=0,5
[44] IEEE Xplore. Available from: https://ieeexplore.ieee.org/Xplore/home.jsp
[45] Barhoumi, Y.; Ghulam, R. Scopeformer: n-CNN-ViT hybrid model for intracranial
hemorrhage classification. arXiv preprint arXiv:2107.04575, 2021.
[46] Wang, M.; Xing, J.; Liu, Y. Actionclip: A new paradigm for video action recognition.
arXiv preprint arXiv:2109.08472, 2021.
[47] Mehta, S.; Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly
vision transformer. arXiv preprint arXiv:2110.02178, 2021.
[48] Choe, J.; Shim, H. Attention-based dropout layer for weakly supervised object localization.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2019, pp. 2219–2228.
[49] Yun, S.; Oh, S. J.; Heo, B.; et al. Videomix: Rethinking data augmentation for video
classification. arXiv preprint arXiv:2012.03457, 2020.
[50] Pan, J.; Lin, Z.; Zhu, X.; et al. St-adapter: Parameter-efficient image-to-video transfer
learning. Advances in Neural Information Processing Systems, volume 35, 2022: pp.
26462–26477.
[51] Sandler, M.; Howard, A.; Zhu, M.; et al. MobileNetV2: Inverted Residuals and Linear
Bottlenecks. 2019, 1801.04381.
[52] Basulto-Lantsova, A.; Padilla-Medina, J. A.; Perez-Pinal, F. J.; et al. Performance
comparative of OpenCV Template Matching method on Jetson TX2 and Jetson Nano
developer kits. In 2020 10th Annual Computing and Communication Workshop and
Conference (CCWC), IEEE, 2020, pp. 0812–0816.
[53] Contreras Paucca, J. R. Dise˜no de un sistema de localizaci´on de un robot m´ovil basado
en mapeo simult´aneo.
[54] Zhou, Y.; Yang, K. Exploring TensorRT to Improve Real-Time Inference for Deep
Learning. In 2022 IEEE 24th Int Conf on High Performance Computing & Communications;
8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City;
8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application
(HPCC/DSS/SmartCity/DependSys), IEEE, 2022, pp. 2011–2018.
[55] Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; et al. Parameter-efficient transfer learning for
NLP. In International Conference on Machine Learning, PMLR, 2019, pp. 2790–2799.
[56] Mehta, S.; Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly
vision transformer. arXiv preprint arXiv:2110.02178, 2021.
[57] He, K.; Zhang, X.; Ren, S.; et al. Delving deep into rectifiers: Surpassing humanlevel
performance on imagenet classification. In Proceedings of the IEEE international
conference on computer vision, 2015, pp. 1026–1034.
[58] Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101, 2017.
[59] Goyal, P.; Doll´ar, P.; Girshick, R.; et al. Accurate, large minibatch sgd: Training
imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[60] Schuhmann, C.; Vencu, R.; Beaumont, R.; et al. LAION-400M: Open Dataset of
CLIP-Filtered 400 Million Image-Text Pairs. CoRR, volume abs/2111.02114, 2021,
2111.02114. Available from: https://arxiv.org/abs/2111.02114
[61] Yu, J.; Wang, Z.; Vasudevan, V.; et al. Coca: Contrastive captioners are image-text
foundation models. arXiv preprint arXiv:2205.01917, 2022.
[62] Li, K.; Wang, Y.; He, Y.; et al. Uniformerv2: Spatiotemporal learning by arming
image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022.
[63] Liu, Z.; Ning, J.; Cao, Y.; et al. Video swin transformer. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–
3211.
[64] Li, Y.; Wu, C.-Y.; Fan, H.; et al. Mvitv2: Improved multiscale vision transformers for
classification and detection. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2022, pp. 4804–4814.
[65] Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video
understanding? In ICML, volume 2, 2021, p. 4.
[66] Park, J.; Lee, J.; Sohn, K. Dual-path Adaptation from Image to Video Transformers.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2023, pp. 2203–2213.
[67] Li, K.; Wang, Y.; Li, Y.; et al. Unmasked teacher: Towards training-efficient video
foundation models. arXiv preprint arXiv:2303.16058, 2023.
[68] Tran, D.; Wang, H.; Torresani, L.; et al. A closer look at spatiotemporal convolutions
for action recognition. In Proceedings of the IEEE conference on Computer Vision and
Pattern Recognition, 2018, pp. 6450–6459.
[69] Feichtenhofer, C.; Fan, H.; Malik, J.; et al. Slowfast networks for video recognition.
In Proceedings of the IEEE/CVF international conference on computer vision, 2019,
pp. 6202–6211.
[70] Wang, J.; Hu, X.; Gan, Z.; et al. Ufo: A unified transformer for vision-language
representation learning. arXiv preprint arXiv:2111.10023, 2021.
[71] Zolfaghari, M.; Singh, K.; Brox, T. Eco: Efficient convolutional network for online
video understanding. In Proceedings of the European conference on computer vision
(ECCV), 2018, pp. 695–712.
[72] Li, Y.; Wu, C.-Y.; Fan, H.; et al. Mvitv2: Improved multiscale vision transformers for
classification and detection. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2022, pp. 4804–4814.
[73] Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the
kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 6299–6308.
[74] Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding.
In Proceedings of the IEEE/CVF international conference on computer vision,
2019, pp. 7083–7093.
[75] Chen, Y.; Kalantidis, Y.; Li, J.; et al. Multi-fiber networks for video recognition. In
Proceedings of the european conference on computer vision (ECCV), 2018, pp. 352–
367.
[76] Neimark, D.; Bar, O.; Zohar, M.; et al. Video transformer network. In Proceedings of
the IEEE/CVF international conference on computer vision, 2021, pp. 3163–3172.
[77] Kumawat, S.; Verma, M.; Nakashima, Y.; et al. Depthwise spatio-temporal STFT
convolutional neural networks for human action recognition. IEEE Transactions on
Pattern Analysis and Machine Intelligence, volume 44, no. 9, 2021: pp. 4839–4851.
[78] Kondratyuk, D.; Yuan, L.; Li, Y.; et al. Movinets: Mobile video networks for efficient
video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2021, pp. 16020–16030.
[79] Zhang, N.; Lei, D.; Zhao, J. An improved Adagrad gradient descent optimization
algorithm. In 2018 Chinese Automation Congress (CAC), IEEE, 2018, pp. 2359–2362.
[80] Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.