簡易檢索 / 詳目顯示

研究生: 登蘇馬
Raden Hadapiningsyah Kusumoseniarto
論文名稱: 用於動作識別的兩流3D卷積注意力網絡
Two-Stream 3D Convolution Attentional Networkfor Action Recognition
指導教授: 項天瑞
Tien-Ruey Hsiang
口試委員: 鄧惟中
Wei-Chung Teng
鮑興國
Hsing-Kuo Pao
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 39
中文關鍵詞: 3D convolutionattention moduleaction recognition
外文關鍵詞: 3D convolution, attention module, action recognition
相關次數: 點閱:185下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

我們提出了一種新方法,該方法使用兩流3D卷積網絡捕獲豐富的空間和時間信息,然後使用注意力模塊進行處理以捕獲長期和短期依存關係,從而識別對視頻的操作。通過利用3D卷積的優勢,不僅可以獲得空間信息,而且還可以將視頻上的運動信息捕獲為時間信息。考慮長期時間相關性信息的主要原因是,確定對視頻的操作非常重要。雙向自我注意網絡使用前向/後向掩碼對時間順序信息進行編碼,並註意處理有關3D卷積特徵的序列。實驗結果表明,該方法可與HMDB-51數據集中的最新工作進行比較,而處理過程卻不那麼複雜,同時還能保持性能。


We propose a new method, which uses a two-stream 3D convolution network to capture rich spatial and temporal information, then process it with an attention module to capture long- and short-term dependency, to recognize action on the videos. By taking advantages of 3D convolutions, not only spatial information is obtained, but the movement information on the videos is also captured as temporal information. The main reason to consider long-term temporal dependency information is that it will be important to identify action on the videos. The bidirectional self-attention network uses forward/backward masks to encode temporal order information, and attention to handle our sequence on 3D convolution features. The experimental results indicate that the proposed method can be compared to state-of-the-art work in the HMDB-51 dataset with a less complex process while maintaining the performance.

Recommendation Letter . . . . . . . . . . . . . . . . . . . . . . . . i Recommendation Letter . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Action Recognition on Video . . . . . . . . . . . . . . . . 1 1.2 Scope of this thesis . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . 6 2.2 Spatio Temporal Features . . . . . . . . . . . . . . . . . . 7 2.3 3D Convolution Neural Network . . . . . . . . . . . . . . 8 vi2.4 Attention Network . . . . . . . . . . . . . . . . . . . . . 9 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 3D Convolution Feature Extraction . . . . . . . . . . . . . 10 3.2 Self-attention Network . . . . . . . . . . . . . . . . . . . 12 3.3 Two-Stream 3D Convolution Attentional Network . . . . . 14 4 Experimental and Result . . . . . . . . . . . . . . . . . . . . . 15 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 16 4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . 17 4.4 Comparison with State-of-the-art Work . . . . . . . . . . 19 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . 24 Appendix A: Example images from the datasets . . . . . . . . . . . 25 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

[1] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
[2] M. Ryoo, B. Rothrock, C. Fleming, and H. Yang, “Privacy-preserving human activity recognition from extreme low resolution.,” in Proceedings of the Association for the Advancement of Artificial Intelligence, pp. 4255–4262, 2017.
[3] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699, 2019.
[4] M. Quigley, K. Mohta, S. S. Shivakumar, M. Watterson, Y. Mulgaonkar, M. Arguedas, K. Sun, S. Liu, B. Pfrommer, V. Kumar, et al., “The open vision computer: An integrated sensing and compute system for mobile robots,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 1834–1840, IEEE, 2019.
[5] R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Hierarchical self-attention network for action localization in videos,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 61–70, 2019.
[6] D. Purwanto, Y.-T. Chen, and W.-H. Fang, “First-person action recognition with temporal pooling and hilbert-huang transform,” IEEE Transactions on Multimedia, 2019.
[7] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-aware attention lstm networks for 3d action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1647–1656, 2017.
[8] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “Mars: Motion-augmented rgb stream for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891, 2019.
[9] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “Potion: Pose motion representation for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7024–7033, 2018.
[10] A. Piergiovanni and M. S. Ryoo, “Representation flow for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9945–9953, 2019.
[11] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[12] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459, 2018.
[13] L. Wang, P. Koniusz, and D. Q. Huynh, “Hallucinating bag-of-words and fisher vector idt terms for cnn-based action recognition,” arXiv preprint arXiv:1906.05910, 2019.
[14] I. Laptev, C. Marszalek, M.and Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, 2008.
[15] A. Klaser, M. Marszałek, and C. Schmid, “A spatio-temporal descriptor based on 3d-gradients,” in Proceedings of British Machine Vision Conference, pp. 275–1, 2008.
[16] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558, 2013.
[17] X. Sun, M. Chen, and A. Hauptmann, “Action recognition via local descriptors and holistic features,”in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 58–65, IEEE, 2009.
[18] S. Sadanand and J. J. Corso, “Action bank: A high-level representation of activity in video,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1234–1241, IEEE, 2012.
[19] F. Baumann, A. Ehlers, B. Rosenhahn, and J. Liao, “Recognizing human actions using novel spacetime volume binary patterns,” Neurocomputing, vol. 173, pp. 54–63, 2016.
[20] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 3, pp. 257–267, 2001.
[21] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, vol. 2, pp. 1395–1402, IEEE, 2005.
[22] A. Noguchi and K. Yanai, “A surf-based spatio-temporal feature for feature-fusion-based action recognition,” in European Conference on Computer Vision, pp. 153–167, Springer, 2010.
[23] Z. Zhang and D. Tao, “Slow feature analysis for human action recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 436–450, 2012.
[24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[25] L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-pooled deep-convolutional descriptors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314, 2015.
[26] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
[27] T. Shen, T. Zhou, G. Long, J. Jiang, and C. Zhang, “Bi-directional block self-attention for fast and memory-efficient sequence modeling,” in Proceedings of the International Conference on Learning Representations, 2018.
[28] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell, “A dual-stage attention-based recurrent neural network for time series prediction,” arXiv preprint arXiv:1704.02971, 2017.
[29] Y. Wang, S. Wang, J. Tang, N. O’Hare, Y. Chang, and B. Li, “Hierarchical attention network for action recognition in videos,” arXiv preprint arXiv:1607.06416, 2016.
[30] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164, 2017.
[31] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,”arXiv preprint arXiv:1805.08318, 2018.
[32] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision, pp. 4489–4497, 2015.
[33] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-l 1 optical flow,” Pattern Recognition, pp. 214–223, 2007.
[34] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[35] K. W. Cheng, Y. T. Chen, and W. H. Fang, “Improved object detection with iterative localization refinement in convolutional neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2017.
[36] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proceedings of the Neural Information Processing Systems, pp. 568–576, 2014.
[37] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2556–2563, 2011.
[38] F. Chollet et al., “Keras.” https://github.com/fchollet/keras, 2015.
[39] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[40] I. K. R{iza Alp Güler, Natalia Neverova, “Densepose: Dense human pose estimation in the wild,”2018.
[41] J. Wang, A. Cherian, F. Porikli, and S. Gould, “Video representation learning using discriminative pooling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1149–1158, 2018.
[42] A. M. Butt, M. H. Yousaf, F. Murtaza, S. Nazir, S. Viriri, and S. A. Velastin, “Agglomerative clustering and residual-vlad encoding for human action recognition,” Applied Sciences, vol. 10, no. 12, p. 4412,2020.

無法下載圖示 全文公開日期 2025/08/25 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE