簡易檢索 / 詳目顯示

研究生: Tomas Zamostny
Tomas Zamostny
論文名稱: 適用於輕量級深度網路動作辨識之高效率參數遷移學習研究
Exploring Parameter-Efficient Transfer Learning for Action Recognition in Lightweight Deep Networks
指導教授: 陳郁堂
Yie-Tarng Chen
口試委員: 陳郁堂
Yie-Tarng Chen
林銘波
Ming-Bo Lin
陳省隆
Hsing-Lung Chen
方文賢
Wen-Hsien Fang
呂政修
Jenq-Shiou Leu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2023
畢業學年度: 112
語文別: 英文
論文頁數: 48
外文關鍵詞: Vision Transformers, Edge Device, Surveillance, Action Recognition, Parameter-Efficient Transfer Learning
相關次數: 點閱:34下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文提出了一種在邊緣設備上部署人類活動識別(HAR)系統的新方法,這對於物聯網(IoT)至關重要。 它解決了基於變壓器的模型的大尺寸和記憶體需求的挑戰,這對於邊緣部署來說是不切實際的。 該解決方案是一個平衡高性能與最少參數數量的模型。 該模型透過採用 MobileViT 架構進行創新,該架構在空間特徵提取方面表現出色。 透過整合 ST 轉接器,修改後的架構可以處理影片以捕捉時間動態,而無需重新訓練整個 MobileViT(仍保持預先訓練狀態)。 這項策略性修改使模型能夠專注於 HAR 任務,同時保持足夠小的尺寸以適應邊緣設備。 Kinetics-400資料集上的效能評估表明,該模型僅用530萬個參數即可實現74.94%的準確率,其中15%在訓練期間更新。 在 Jetson Nano 上進行即時處理時,模型以每秒 16.45 幀的速度運行,使用 2.58 GB RAM,預測精度保持在 71.07%。


    This thesis proposes a novel approach for deploying Human Activity Recognition (HAR) systems on edge devices, crucial for the Internet of Things (IoT). It addresses the challenge of the large size and memory requirements of transformer-based models, which are impractical for edge deployment. The solution is a model that balances high performance with a minimum number of parameters. The model innovates by adapting the MobileViT architecture, which excels in spatial feature extraction. By incorporating ST adapters, the modified architecture can process video to capture temporal dynamics without retraining the entire MobileViT, which remains pre-trained. This strategic modification allows the model to specialise in HAR tasks while remaining small enough for edge devices. Performance evaluations on the Kinetics-400 dataset show that the model achieves 74.94% accuracy with only 5.3 million parameters, of which 15% are updated during training. In real-time processing on a Jetson Nano, the model runs at 16.45 frames per second, using 2.58 GB of RAM, and maintains a prediction accuracy of 71.07%.

    1 Introduction 1 1.1 Contributions 2 1.2 Structure of the Thesis 2 2 Background 3 2.1 Edge Computing 3 2.2 Nvidia Jetson Nano 7 2.3 Neural Networks 8 2.4 Machine Learning 9 2.5 Deep Learning 9 2.6 A Single Neuron 10 2.6.1 Perceptron 11 2.6.2 Multi-layer Perceptron 12 2.7 Deep Learning Frameworks 13 2.7.1 TensorFlow 13 2.7.2 Pytorch 13 2.8 Convolutional Neural Network 13 2.8.1 Padding and Stride 15 2.8.2 Pooling 15 2.9 Transformers 16 2.9.1 Vision Transformers 19 2.10 Computer Vision - Action Recognition 20 2.11 Datasets 21 2.11.1 ImageNet-1K 21 2.11.2 Kinetics-400 21 3 Methodology 22 3.1 Goal 22 3.2 Search Methodology 22 3.3 Enhancing CNN Capabilities 23 3.4 Feature Extraction Approach 23 3.5 Preprocessing 24 3.6 Alternative Architecture 24 3.7 MobileViT 25 3.7.1 Architecture 26 3.7.2 Lightweight 27 3.8 Spatio-Temporal Adapter 28 3.9 MobileNet-V2 29 3.10 Adapting MobileViT with ST-Adapters 30 3.11 Target Edge Device 32 3.12 Optimization Tool for Edge Device 32 3.13 Training Approach 33 3.13.1 Full Fine-Tuning 33 3.13.2 Adapter Fine-Tuning 33 4 Experimental Results 35 4.1 Experimental Settings 35 4.1.1 Training Approach 35 4.1.2 Data Augmentation 35 4.2 Implementation 36 4.3 Kinetics-400 Dataset 37 4.4 Evaluation Metrics 38 4.4.1 Evaluation Protocol 39 4.5 Visualisation of Training 40 4.6 Results 42 4.6.1 Comparison of Different Optimizers 43 4.6.2 Inference on Server 44 4.7 Deployment on Edge 45 4.7.1 Inference 46 5 Conclusions 48 Bibliography 49

    [1] Jetson Nano Developer Kit. Available from: https://developer.nvidia.com/
    embedded/jetson-nano-developer-kit
    [2] vaishnav, R. Visualizing Feature Maps using PyTorch. June 2021. Available
    from: https://ravivaishnav20.medium.com/visualizing-feature-maps-usingpytorch-
    12a48cd1e573
    [3] Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention is all you need. Advances in
    neural information processing systems, volume 30, 2017.
    [4] Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; et al. An image is worth 16x16 words:
    Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    [5] Mehta, S.; Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly
    vision transformer. arXiv preprint arXiv:2110.02178, 2021.
    [6] Tsang, S.-H. Review: MobileNetV2 — Light Weight Model (Image Classification).
    Aug. 2019. Available from: https://towardsdatascience.com/reviewmobilenetv2-
    light-weight-model-image-classification-8febb490e61c
    [7] S¨uzen, A. A.; Duman, B.; S¸en, B. Benchmark analysis of jetson tx2, jetson nano and
    raspberry pi using deep-cnn. In 2020 International Congress on Human-Computer
    Interaction, Optimization and Robotic Applications (HORA), IEEE, 2020, pp. 1–5.
    [8] Fu, J.; Rui, Y. Advances in deep learning approaches for image tagging. APSIPA
    Transactions on Signal and Information Processing, volume 6, 2017: p. e11.
    [9] Zhang, S.; Callaghan, V. Real-time human posture recognition using an adaptive hybrid
    classifier. International Journal of Machine Learning and Cybernetics, volume 12,
    2021: pp. 489–499.
    [10] Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. International
    Journal of Computer Vision, volume 130, no. 5, 2022: pp. 1366–1401.
    [11] Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; et al. Parameter-efficient transfer learning for
    NLP. In International Conference on Machine Learning, PMLR, 2019, pp. 2790–2799.
    [12] Krishnasamy, E.; Varrette, S.; Mucciardi, M. Edge Computing: An overview of framework
    and applications. 2020.
    [13] Ullah, S.; Kim, D.-H. Benchmarking Jetson platform for 3D point-cloud and hyperspectral
    image classification. In 2020 IEEE International conference on big data and
    smart computing (BigComp), IEEE, 2020, pp. 477–482.
    [14] Mittal, S. A Survey on optimized implementation of deep learning models on the
    NVIDIA Jetson platform. Journal of Systems Architecture, volume 97, 2019: pp. 428–
    442.
    [15] Pietschmann, C. ‘Raspberry Pi 4 vs NVIDIA Jetson Nano Developer Kit. ht
    tps://build5nines. com/raspberry-pi-4-vs-nvidia-jetson-nano-dev e loper-kit, 2019.
    [16] Aggarwal, C. C.; et al. Neural networks and deep learning. Springer, volume 10, no.
    978, 2018: p. 3.
    [17] Breiman, L. Statistical modeling: The two cultures (with comments and a rejoinder
    by the author). Statistical science, volume 16, no. 3, 2001: pp. 199–231.
    [18] Jordan, M. I.; Mitchell, T. M. Machine learning: Trends, perspectives, and prospects.
    Science, volume 349, no. 6245, 2015: pp. 255–260.
    [19] LeCun, Y.; Bottou, L.; Bengio, Y.; et al. Gradient-based learning applied to document
    recognition. Proceedings of the IEEE, volume 86, no. 11, 1998: pp. 2278–2324.
    [20] Bottou, L.; Bengio, Y.; Le Cun, Y. Global training of document processing systems
    using graph transformer networks. In proceedings of IEEE computer society conference
    on computer vision and pattern recognition, IEEE, 1997, pp. 489–494.
    [21] LeCun, Y.; Bengio, Y.; et al. Convolutional networks for images, speech, and time
    series. The handbook of brain theory and neural networks, volume 3361, no. 10, 1995:
    p. 1995.
    [22] LeCun, Y.; Bottou, L.; Bengio, Y.; et al. Gradient-based learning applied to document
    recognition. Proceedings of the IEEE, volume 86, no. 11, 1998: pp. 2278–2324.
    [23] McCulloch, W. S.; Pitts, W. A logical calculus of the ideas immanent in nervous
    activity. The bulletin of mathematical biophysics, volume 5, 1943: pp. 115–133.
    [24] Gallant, S. I.; et al. Perceptron-based learning algorithms. IEEE Transactions on
    neural networks, volume 1, no. 2, 1990: pp. 179–191.
    [25] Dubey, S. R.; Singh, S. K.; Chaudhuri, B. B. A comprehensive survey and performance
    analysis of activation functions in deep learning. arXiv preprint arXiv:2109.14545,
    2021.
    [26] Kashyap, A. Math behind Perceptrons. Nov. 2019. Available from: https://
    medium.com/@iamask09/math-behind-perceptrons-7241d5dadbfc
    [27] Almeida, L. B. Multilayer perceptrons. In Handbook of Neural Computation, CRC
    Press, 2020, pp. C1–2.
    [28] Sayad, S. “Artificial Neural Network- Perceptron.
    [29] Abadi, M.; Agarwal, A.; Barham, P.; et al. Tensorflow: Large-scale machine learning
    on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
    [30] Paszke, A.; Gross, S.; Massa, F.; et al. Pytorch: An imperative style, high-performance
    deep learning library. Advances in neural information processing systems, volume 32,
    2019.
    [31] Harris, C. R.; Millman, K. J.; Van Der Walt, S. J.; et al. Array programming with
    NumPy. Nature, volume 585, no. 7825, 2020: pp. 357–362.
    [32] Paszke, A.; Gross, S.; Chintala, S.; et al. Automatic differentiation in PyTorch.(2017).
    2017.
    [33] O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv preprint
    arXiv:1511.08458, 2015.
    [34] Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet classification with deep convolutional
    neural networks. Advances in neural information processing systems, volume 25,
    2012.
    [35] Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image
    recognition. arXiv preprint arXiv:1409.1556, 2014.
    [36] Szegedy, C.; Liu, W.; Jia, Y.; et al. Going deeper with convolutions. In Proceedings of
    the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
    [37] He, K.; Zhang, X.; Ren, S.; et al. Deep residual learning for image recognition. In
    Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,
    pp. 770–778.
    [38] Goodfellow, I.; Bengio, Y.; Courville, A. Deep learning. MIT press, 2016.
    [39] LeCun, Y.; Boser, B.; Denker, J. S.; et al. Backpropagation applied to handwritten
    zip code recognition. Neural computation, volume 1, no. 4, 1989: pp. 541–551.
    [40] Lee, H.; Grosse, R.; Ranganath, R.; et al. Unsupervised learning of hierarchical representations
    with convolutional deep belief networks. Communications of the ACM,
    volume 54, no. 10, 2011: pp. 95–103.
    [41] Deng, J.; Dong, W.; Socher, R.; et al. Imagenet: A large-scale hierarchical image
    database. In 2009 IEEE conference on computer vision and pattern recognition, Ieee,
    2009, pp. 248–255.
    [42] Kay, W.; Carreira, J.; Simonyan, K.; et al. The kinetics human action video dataset.
    arXiv preprint arXiv:1705.06950, 2017.
    [43] Google Scholar. Available from: https://scholar.google.com/schhp?hl=
    en&as sdt=0,5
    [44] IEEE Xplore. Available from: https://ieeexplore.ieee.org/Xplore/home.jsp
    [45] Barhoumi, Y.; Ghulam, R. Scopeformer: n-CNN-ViT hybrid model for intracranial
    hemorrhage classification. arXiv preprint arXiv:2107.04575, 2021.
    [46] Wang, M.; Xing, J.; Liu, Y. Actionclip: A new paradigm for video action recognition.
    arXiv preprint arXiv:2109.08472, 2021.
    [47] Mehta, S.; Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly
    vision transformer. arXiv preprint arXiv:2110.02178, 2021.
    [48] Choe, J.; Shim, H. Attention-based dropout layer for weakly supervised object localization.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition, 2019, pp. 2219–2228.
    [49] Yun, S.; Oh, S. J.; Heo, B.; et al. Videomix: Rethinking data augmentation for video
    classification. arXiv preprint arXiv:2012.03457, 2020.
    [50] Pan, J.; Lin, Z.; Zhu, X.; et al. St-adapter: Parameter-efficient image-to-video transfer
    learning. Advances in Neural Information Processing Systems, volume 35, 2022: pp.
    26462–26477.
    [51] Sandler, M.; Howard, A.; Zhu, M.; et al. MobileNetV2: Inverted Residuals and Linear
    Bottlenecks. 2019, 1801.04381.
    [52] Basulto-Lantsova, A.; Padilla-Medina, J. A.; Perez-Pinal, F. J.; et al. Performance
    comparative of OpenCV Template Matching method on Jetson TX2 and Jetson Nano
    developer kits. In 2020 10th Annual Computing and Communication Workshop and
    Conference (CCWC), IEEE, 2020, pp. 0812–0816.
    [53] Contreras Paucca, J. R. Dise˜no de un sistema de localizaci´on de un robot m´ovil basado
    en mapeo simult´aneo.
    [54] Zhou, Y.; Yang, K. Exploring TensorRT to Improve Real-Time Inference for Deep
    Learning. In 2022 IEEE 24th Int Conf on High Performance Computing & Communications;
    8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City;
    8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application
    (HPCC/DSS/SmartCity/DependSys), IEEE, 2022, pp. 2011–2018.
    [55] Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; et al. Parameter-efficient transfer learning for
    NLP. In International Conference on Machine Learning, PMLR, 2019, pp. 2790–2799.
    [56] Mehta, S.; Rastegari, M. Mobilevit: light-weight, general-purpose, and mobile-friendly
    vision transformer. arXiv preprint arXiv:2110.02178, 2021.
    [57] He, K.; Zhang, X.; Ren, S.; et al. Delving deep into rectifiers: Surpassing humanlevel
    performance on imagenet classification. In Proceedings of the IEEE international
    conference on computer vision, 2015, pp. 1026–1034.
    [58] Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv preprint
    arXiv:1711.05101, 2017.
    [59] Goyal, P.; Doll´ar, P.; Girshick, R.; et al. Accurate, large minibatch sgd: Training
    imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
    [60] Schuhmann, C.; Vencu, R.; Beaumont, R.; et al. LAION-400M: Open Dataset of
    CLIP-Filtered 400 Million Image-Text Pairs. CoRR, volume abs/2111.02114, 2021,
    2111.02114. Available from: https://arxiv.org/abs/2111.02114
    [61] Yu, J.; Wang, Z.; Vasudevan, V.; et al. Coca: Contrastive captioners are image-text
    foundation models. arXiv preprint arXiv:2205.01917, 2022.
    [62] Li, K.; Wang, Y.; He, Y.; et al. Uniformerv2: Spatiotemporal learning by arming
    image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022.
    [63] Liu, Z.; Ning, J.; Cao, Y.; et al. Video swin transformer. In Proceedings of the
    IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–
    3211.
    [64] Li, Y.; Wu, C.-Y.; Fan, H.; et al. Mvitv2: Improved multiscale vision transformers for
    classification and detection. In Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition, 2022, pp. 4804–4814.
    [65] Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video
    understanding? In ICML, volume 2, 2021, p. 4.
    [66] Park, J.; Lee, J.; Sohn, K. Dual-path Adaptation from Image to Video Transformers.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
    2023, pp. 2203–2213.
    [67] Li, K.; Wang, Y.; Li, Y.; et al. Unmasked teacher: Towards training-efficient video
    foundation models. arXiv preprint arXiv:2303.16058, 2023.
    [68] Tran, D.; Wang, H.; Torresani, L.; et al. A closer look at spatiotemporal convolutions
    for action recognition. In Proceedings of the IEEE conference on Computer Vision and
    Pattern Recognition, 2018, pp. 6450–6459.
    [69] Feichtenhofer, C.; Fan, H.; Malik, J.; et al. Slowfast networks for video recognition.
    In Proceedings of the IEEE/CVF international conference on computer vision, 2019,
    pp. 6202–6211.
    [70] Wang, J.; Hu, X.; Gan, Z.; et al. Ufo: A unified transformer for vision-language
    representation learning. arXiv preprint arXiv:2111.10023, 2021.
    [71] Zolfaghari, M.; Singh, K.; Brox, T. Eco: Efficient convolutional network for online
    video understanding. In Proceedings of the European conference on computer vision
    (ECCV), 2018, pp. 695–712.
    [72] Li, Y.; Wu, C.-Y.; Fan, H.; et al. Mvitv2: Improved multiscale vision transformers for
    classification and detection. In Proceedings of the IEEE/CVF Conference on Computer
    Vision and Pattern Recognition, 2022, pp. 4804–4814.
    [73] Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the
    kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, 2017, pp. 6299–6308.
    [74] Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding.
    In Proceedings of the IEEE/CVF international conference on computer vision,
    2019, pp. 7083–7093.
    [75] Chen, Y.; Kalantidis, Y.; Li, J.; et al. Multi-fiber networks for video recognition. In
    Proceedings of the european conference on computer vision (ECCV), 2018, pp. 352–
    367.
    [76] Neimark, D.; Bar, O.; Zohar, M.; et al. Video transformer network. In Proceedings of
    the IEEE/CVF international conference on computer vision, 2021, pp. 3163–3172.
    [77] Kumawat, S.; Verma, M.; Nakashima, Y.; et al. Depthwise spatio-temporal STFT
    convolutional neural networks for human action recognition. IEEE Transactions on
    Pattern Analysis and Machine Intelligence, volume 44, no. 9, 2021: pp. 4839–4851.
    [78] Kondratyuk, D.; Yuan, L.; Li, Y.; et al. Movinets: Mobile video networks for efficient
    video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision
    and Pattern Recognition, 2021, pp. 16020–16030.
    [79] Zhang, N.; Lei, D.; Zhao, J. An improved Adagrad gradient descent optimization
    algorithm. In 2018 Chinese Automation Congress (CAC), IEEE, 2018, pp. 2359–2362.
    [80] Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint
    arXiv:1609.04747, 2016.

    QR CODE