簡易檢索 / 詳目顯示

研究生: 林致遠
Zhi-Yuan Lin
論文名稱: 基於改良卷積長短期記憶網路與 BERT 的即時人體行為辨識
Dynamical Frame Human Action Recognition by Modified Convolutional LSTM with BERT
指導教授: 蘇順豐
Shun-Feng Su
口試委員: 陸敬互
Ching-Hu Lu
郭重顯
Chung-Hsien Kuo
黃有評
Yo-Ping Huang
姚立德
Leeh-Ter Yao
蘇順豐
Shun-Feng Su
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 67
中文關鍵詞: 深度學習人體行為辨識卷積長短期記憶網路BERT電腦視覺
外文關鍵詞: deep learning, human action recognition, convolutional LSTM, BERT, computer vision
相關次數: 點閱:315下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在本論文中使用了改良卷積長短期記憶網路(ModConvLSTM)[1]所組成的深度學習模型與透過人體姿態估計(pose estimation)所形成的熱點圖作為輸入影像來進行人體行為辨識,並且測試了不同深度下,改良卷積長短期記憶網路所帶來的影響。並以基於變換器的雙向編碼器表示技術(BERT)[2]來替代全局平均(Global Average Pooling),改善二維卷積神經網路(2D-CNN)在處理時間序列上的缺陷。BERT透過在訓練階段加入遮罩與注意力機制來使模型擁有更好的上下文推斷能力, BERT使我們的網路於NTU-60[3]和NTU-120[4]資料集上得到了91.46%與83.06%的準確度,並在僅僅增加0.1G FLOPs的計算量下分別提高了1.63%與3.22%的準確度。除此之外我們的網路還能夠進行即時辨識偵測,每當有新的影像輸入時,只需要進行一次長短期記憶單元的更新與BERT的計算便能夠得到當下的辨識結果,我們的模型於中央處理器(CPU)僅需要14.2ms完成即時計算。本論文提供了一種即時人體行為辨識方法,在未知動作開始與結束時間點的情況下,仍然能夠進行辨識,使我們的模型更加適合於應用領域,為人體行為辨識領域提供了更加合理的解決方法。


In this study, a deep architecture with the modified Convolutional Long Short-Term Memory [1] (ModConvLSTM) and pose heatmaps are used as input features to achieve human action recognition. The effects of ModConvLSTM are verified at different depths. We proposed to replace Global Average Pooling (GAP) by BERT [2] to improve the processing time sequence issue in the two-dimensional convolutional neural networks (2D-CNNs). By adding a mask in training stage and an attention mechanism in BERT, the deep network proposed can have better capability on context inferencing. With BERT, our network can get 91.46% and 83.06% accuracy on NTU-60 [3] and NTU-120 [4] datasets, and increased our model’s accuracy by 1.63% and 3.22%, respectively, with only 0.1G FLOPs more computation. In addition, our network can perform dynamical frame recognition. Whenever a new frame comes in, it only needs to update the ModConvLSTM cell and compute BERT once to obtain new inference result at that time, and our network requires only 14.2ms to complete dynamical frame recognition on an CPU device. This study provided a human action recognition approach can achieve dynamical frame recognition, this approach can recognize a video without knowing the beginning and the end of an action. Thus, our model can be more appropriate on various application fields and can be employed as a more reasonable solution for the human action recognition field.

中文摘要 I Abstract II 致謝 III Table of Contents IV List of Figures VII List of Tables IX Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Research Objective 3 1.4 Thesis Contributions 4 1.5 Thesis Organization 5 Chapter 2 Related Work 6 2.1 RGB-based Action Recognition 6 2.2 Skeleton-based Action Recognition 7 2.3 Pose Estimation 8 Chapter 3 Methodology 10 3.1 Human Action Recognition 10 3.1.1 Modified Convolutional LSTM + 2D-CNN 10 3.1.2 2D-Residual Network with BERT 12 3.1.3 Convolutional Long Short-Term Memory 14 3.1.4 BERT 16 3.1.5 Dynamical Frame Action Recognition System 19 3.2 Dataset 21 3.2.1 NTU RGB+D 21 3.3 Data Preprocessing 22 3.3.1 Pose Estimation 22 3.3.2 Subject Center Crop 23 3.3.3 Keypoint Heatmap 25 3.4 Data Augmentation 26 3.4.1 Uniform Sampling 26 3.4.2 Random Horizontal Flip 27 3.4.3 Random Resized Crop 28 Chapter 4 Experiments 30 4.1 Hardware 30 4.2 Software 31 4.3 Hyperparameters 32 4.3.1 Batch Size 32 4.3.2 Epoch 32 4.3.3 Optimizer 33 4.3.4 Learning Rate and Learning Rate Scheduler 33 4.3.5 Video Sample Length and Size 34 4.4 Training Step 35 4.5 Experiment Results 35 4.5.1 Training Results 36 4.5.2 Confusion Matrices 40 4.5.3 Network Comparisons 43 4.5.4 Dynamical Frame Recognition Testing 44 4.5.5 Demo of Dynamical Frame Recognition 46 Chapter 5 Conclusions and Future Work 48 5.1 Conclusions 48 5.2 Future Work 49 References 50

[1] L. Zhang, G. Zhu, L. Mei, P. Shen, S. A. A. Shah, and M. Bennamoun, "Attention in convolutional LSTM for gesture recognition," Advances in neural information processing systems, vol. 31, 2018.
[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[3] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "Ntu rgb+ d: A large scale dataset for 3d human activity analysis," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010-1019.
[4] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, "Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding," IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684-2701, 2019.
[5] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, "Dense trajectories and motion boundary descriptors for action recognition," International journal of computer vision, vol. 103, no. 1, pp. 60-79, 2013.
[6] H. Wang and C. Schmid, "Action recognition with improved trajectories," in Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551-3558.
[7] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489-4497.
[8] J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299-6308.
[9] K. Hara, H. Kataoka, and Y. Satoh, "Learning spatio-temporal features with 3d residual networks for action recognition," in Proceedings of the IEEE international conference on computer vision workshops, 2017, pp. 3154-3160.
[10] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, "A closer look at spatiotemporal convolutions for action recognition," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450-6459.
[11] C. Feichtenhofer, H. Fan, J. Malik, and K. He, "Slowfast networks for video recognition," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202-6211.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017.
[13] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[14] L. R. Medsker and L. Jain, "Recurrent neural networks," Design and Applications, vol. 5, pp. 64-67, 2001.
[15] H. Sak, A. W. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," 2014.
[16] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[17] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, "Revisiting skeleton-based action recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969-2978.
[18] L. Zhang, G. Zhu, P. Shen, J. Song, S. Afaq Shah, and M. Bennamoun, "Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition," in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 3120-3128.
[19] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, "Convolutional LSTM network: A machine learning approach for precipitation nowcasting," Advances in neural information processing systems, vol. 28, 2015.
[20] L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger, and M. J. Black, "On the integration of optical flow and action recognition," in German conference on pattern recognition, 2018: Springer, pp. 281-297.
[21] C. Feichtenhofer, "X3d: Expanding architectures for efficient video recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 203-213.
[22] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, "Dynamic image networks for action recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3034-3042.
[23] X. Liu, S. L. Pintea, F. K. Nejadasl, O. Booij, and J. C. van Gemert, "No frame left behind: Full video action recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14892-14901.
[24] S. Yenduri, N. Perveen, and V. Chalavadi, "Fine-grained action recognition using dynamic kernels," Pattern Recognition, vol. 122, p. 108282, 2022.
[25] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
[26] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "Vivit: A video vision transformer," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836-6846.
[27] Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012-10022.
[28] Z. Liu et al., "Swin transformer v2: Scaling up capacity and resolution," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12009-12019.
[29] Z. Liu et al., "Video swin transformer," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202-3211.
[30] Y. Han, S.-L. Chung, Q. Xiao, W. Y. Lin, and S.-F. Su, "Global spatio-temporal attention for action recognition based on 3D human skeleton data," IEEE Access, vol. 8, pp. 88604-88616, 2020.
[31] Y. Han, S.-L. Chung, A. Ambikapathi, J.-S. Chan, W.-Y. Lin, and S.-F. Su, "Robust human action recognition using global spatial-temporal attention for human skeleton data," in 2018 International Joint Conference on Neural Networks (IJCNN), 2018: IEEE, pp. 1-8.
[32] S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Thirty-second AAAI conference on artificial intelligence, 2018.
[33] S. Das, S. Sharma, R. Dai, F. Bremond, and M. Thonnat, "Vpn: Learning video-pose embedding for activities of daily living," in European Conference on Computer Vision, 2020: Springer, pp. 72-90.
[34] W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, "Language supervised training for skeleton-based action recognition," arXiv preprint arXiv:2208.05318, 2022.
[35] C. Caetano, J. Sena, F. Brémond, J. A. Dos Santos, and W. R. Schwartz, "Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition," in 2019 16th IEEE international conference on advanced video and signal based surveillance (AVSS), 2019: IEEE, pp. 1-8.
[36] H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, "MMTM: Multimodal transfer module for CNN fusion," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13289-13299.
[37] A. Newell, K. Yang, and J. Deng, "Stacked hourglass networks for human pose estimation," in European conference on computer vision, 2016: Springer, pp. 483-499.
[38] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, "Cascaded pyramid network for multi-person pose estimation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7103-7112.
[39] K. Sun, B. Xiao, D. Liu, and J. Wang, "Deep high-resolution representation learning for human pose estimation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693-5703.
[40] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291-7299.
[41] M. Kalfaoglu, S. Kalkan, and A. A. Alatan, "Late temporal modeling in 3d cnn architectures with bert for action recognition," in European Conference on Computer Vision, 2020: Springer, pp. 731-747.
[42] T.-Y. Lin et al., "Microsoft coco: Common objects in context," in European conference on computer vision, 2014: Springer, pp. 740-755.
[43] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.
[44] A. Paszke et al., "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, vol. 32, 2019.
[45] I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," arXiv preprint arXiv:1711.05101, 2017.
[46] I. Loshchilov and F. Hutter, "Sgdr: Stochastic gradient descent with warm restarts," arXiv preprint arXiv:1608.03983, 2016.

QR CODE