簡易檢索 / 詳目顯示

研究生: Aji Setyoko
Aji Setyoko
論文名稱: 運用配對相鄰矩陣ST-GCN於基於人體骨架之雙人互動動作辨識
PAIRWISE ADJACENCY MATRIX ON ST-GCN FOR SKELETON-BASED TWO-PERSON INTERACTION RECOGNITION
指導教授: 花凱龍
Kai-Lung Hua
楊朝龍
Chao-Lung Yang
口試委員: 賴祐吉
Yu-Chi Lai
歐陽超
Chao Ou-Yang
楊朝龍
Chao-Lung Yang
花凱龍
Kai-Lung Hua
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 53
中文關鍵詞: 基於骨架的動作識別時空圖卷積網路雙人互動識別配對相鄰矩陣
外文關鍵詞: skeleton-based action recognition, spatial-temporal graph convolution network, two person-interaction, pairwise adjacency matrix
相關次數: 點閱:258下載:62
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

時空圖卷積網路(ST-GCN)在人體動作辨識上近年取得了出色的表現。然而,由於兩個人體之上骨架之間的連接關係並未明確定義,使得ST-GCN模型在雙人互動動作辨識(TPIR)上有待進一步的研究。本研究基於ST-GCN模型,針對雙人互動動作辨識進行改善。首先採用配對相鄰矩陣(Pairwise Adjacency Matrix, PAM)來定義兩人體骨架之間的連接關係,並提出ST-GCN-PAM模型。本研究使用NTU RGB+D 120數據集對提出之ST-GCN-PAM模型進行可行性評估。此外,本研究亦使用Kinetics、NTU RGB+D 60、UT-Interaction以及SBU-Kinectics四個數據集進行模型之驗證。研究結果顯示,本研究提出之ST-GCN-PAM模型在NTU-RGB+D數據集的雙人互動動作辨識上達到了83.28%(cross-subject)及88.31%(cross-view)的準確度,優於目前最先進的方法。ST-GCN-PAM模型在Kinetics數據集的多人動作識別上亦優於原始的ST-GCN模型,並在Top-1及Top-5中分別達到41.68%及88.91%的識別準確度。同樣地,ST-GCN-PAM模型在UT-Interaction數據集及SBU-Kinectics數據集中皆有較優異的表現。在UT-Interaction數據集的Set-1及Set-2中分別達到76.6%及77.3%的準確度;在SBU-Kinectics數據集中則達到94.6%的識別準確度。


Spatial-temporal graph convolutional networks (ST-GCN) have achieved the outstanding performances on human action recognition. However, it might be less superior on a two-person interaction recognition (TPIR) task because the relationship among skeletons is not defined for TPIR. In this study, we present an improvement of the ST-GCN model that focuses on TPIR by employing the pairwise adjacency matrix to capture the relationship of person-person skeletons (ST-GCN-PAM). To validate the effectiveness of the proposed ST-GCN-PAM model on TPIR, experiments were conducted on NTU RGB+D 120 dataset. Additionally, the model was also examined on the Kinetics dataset, NTU RGB+D 60, UT-Interaction, and SBU-Kinetics. The results show that the proposed ST-GCN-PAM outperforms the-state-of-the-art methods on mutual action of NTU RGB+D 120 by achieving recognition accuracy with 83.28% (cross-subject) and 88.31% (cross-view). The proposed ST-GCN-PAM also outperforms the original ST-GCN on the multi-human action of the Kinetics dataset by achieving 41.68% in Top-1 and 88.91% in Top-5. Similarly, ST-GCN-PAM have superior performance with 76.6% and 77.3% for UT-Interaction Set-1 and Set-2, and 94.6% with SBU-Kinetics dataset, respectively.

ABSTRACT ii 摘要 iii TABLE OF CONTENTS iv LIST OF FIGURES vi LIST OF TABLES vii CHAPTER I INTRODUCTION 1 CHAPTER II LITERATURE REVIEW 4 2.1 Skeleton Based Action Recognition 4 2.2 Graph Convolution Network 5 2.3 Two-Person Interaction Recognition (TPIR) 6 2.4 Pairwise Graph Connectivity 7 2.5 Action Recognition Dataset 7 2.5.1 NTU RGB+D 120 [6] and NTU RGB+D 60 [7] 8 2.5.2 UT-Interaction 10 2.5.3 SBU-Kinetics Dataset 11 2.5.4 Kinetics Dataset 12 CHAPTER III PAIRWISE ADJENCY MATRIX ON SPATIAL TEMPORAL GRAPH CONVOLUTION NETWORK 14 3.1 Overall Methodology 14 3.2 Spatial-Temporal Graph Convolution Network 15 3.2.1 Graph Definition 16 3.2.2 Spatial-Temporal Graph Convolution Network 17 3.2.3 Skeleton Graph Construction 18 3.2.4 Implementation of ST-GCN 20 3.3 Pairwise Connectivity 21 3.4 Pairwise Adjacency Matrix 22 3.5 Network and Training Architecture 23 CHAPTER IV RESULT AND DISCUSSION 26 4.1 Dataset Evaluation 26 4.2 Experimental Settings 28 4.3 Result and Discussion 28 CHAPTER V CONCLUSIONS AND FUTURE WORKS 39 5.1 CONCLUSIONS 39 5.2 FUTURE WORKS 39 REFERENCES 41

[1] K. Hara, H. Kataoka, and Y. Satoh, "Learning spatio-temporal features with 3D residual networks for action recognition," in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 3154-3160.
[2] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in neural information processing systems, 2014, pp. 568-576.
[3] L. Shi, Y. Zhang, J. Cheng, and H. Lu, "Skeleton-Based Action Recognition with Directed Graph Neural Networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, 2019, pp. 7912-7921.
[4] M. Fu et al., "Human Action Recognition: A Survey," in Proceedings of the 5th International Conference on Signal and Information Processing, Networking and Computers (ICSINC), Yuzhou, China, 2018, pp. 69-77: Springer Singapore.
[5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, Hawaii, 2017, pp. 7291-7299.
[6] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L. Duan, and A. K. Chichung, "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-1, 2019.
[7] A. Shahroudy, J. Liu, T. Ng, and G. Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 1010-1019.
[8] M. J. Marín-Jiménez, E. Yeguas, and N. Pérez de la Blanca, "Exploring STIP-based models for recognizing human interactions in TV videos," Pattern Recognition Letters, vol. 34, no. 15, pp. 1819-1828, 1 November 2013 2013.
[9] R. Vemulapalli, F. Arrate, and R. Chellappa, "Human action recognition by representing 3d skeletons as points in a lie group," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588-595.
[10] T. N. Kipf and M. Welling, "Semi-Supervised Classification with Graph Convolutional Networks," presented at the 5th International Conference on Learning Representations (ICLR-17), Toulon, France, April 24-26, 2017. Available: https://openreview.net/forum?id=SJU4ayYgl
[11] S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, 2018, pp. 7444--7452: AAAI Press.
[12] L. Shi, Y. Zhang, J. Cheng, and H.Lu, "Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition," in CVPR 2019, Long Beach, CA, USA, 2019.
[13] F. Sener and N. Ikizler-Cinbis, "Two-person interaction recognition via spatial multiple instance embedding," Journal of Visual Communication and Image Representation, vol. 32, pp. 63-73, 2015.
[14] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras, "Two-person interaction detection using body-pose features and multiple instance learning," in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012, pp. 28-35: IEEE.
[15] M. Perez, J. Liu, and A. C. Kot, "Interaction Relational Network for Mutual Action Recognition," arXiv:1910.04963,
[16] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del Bimbo, "3-d human action recognition by shape analysis of motion trajectories on riemannian manifold," IEEE transactions on cybernetics, vol. 45, no. 7, pp. 1340-1352, 2014.
[17] M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban, "Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations," in Twenty-Third International Joint Conference on Artificial Intelligence, 2013.
[18] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, "Modeling video evolution for action recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5378-5387.
[19] R. Zhao, K. Wang, H. Su, and Q. Ji, "Bayesian Graph Convolution LSTM for Skeleton Based Action Recognition," in Proceedings of the IEEE International Conference on Computer Vision (ICCV 2019), Seoul, South Korea, 2019, pp. 6882-6892.
[20] J. Liu, A. Shahroudy, G. Wang, L. Duan, and A. K. Chichung, "Skeleton-Based Online Action Prediction Using Scale Selection Network," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-1, 2019.
[21] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, "Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition," presented at the CVPR 2019, Long Beach, CA, USA, June 16-20, 2019.
[22] T. Soo Kim and A. Reiter, "Interpretable 3d human action analysis with temporal convolutional networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 20-28.
[23] M. Liu, H. Liu, and C. Chen, "Enhanced skeleton visualization for view invariant human action recognition," Pattern Recognition, vol. 68, pp. 346-362, 2017.
[24] C. Caetano, J. Sena, F. Brémond, J. A. D. Santos, and W. R. Schwartz, "SkeleMotion: A New Representation of Skeleton Joint Sequences based on Motion Information for 3D Action Recognition," in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, Taiwan, 2019, pp. 1-8.
[25] C. Wu, X.-J. Wu, and J. Kittler, "Spatial Residual Layer and Dense Connection Block Enhanced Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition," in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0-0.
[26] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, "Graph convolutional networks: a comprehensive review," Computational Social Networks, vol. 6, no. 1, p. 11, 2019.
[27] P. Battaglia, R. Pascanu, M. Lai, and D. J. Rezende, "Interaction networks for learning about objects, relations and physics," in Advances in neural information processing systems, 2016, pp. 4502-4510.
[28] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, "Relation networks for object detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3588-3597.
[29] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, "Neural relational inference for interacting systems," arXiv preprint arXiv:1802.04687, 2018.
[30] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, "Spectral networks and locally connected networks on graphs," arXiv preprint arXiv:1312.6203, 2013.
[31] M. Defferrard, X. Bresson, and P. Vandergheynst, "Convolutional neural networks on graphs with fast localized spectral filtering," in Advances in neural information processing systems, 2016, pp. 3844-3852.
[32] D. K. Duvenaud et al., "Convolutional networks on graphs for learning molecular fingerprints," in Advances in neural information processing systems, 2015, pp. 2224-2232.
[33] A. Manzi, L. Fiorini, R. Limosani, P. Dario, and F. Cavallo, "Two-person activity recognition using skeleton data," IET computer Vision, vol. 12, no. 1, pp. 27-35, 2017.
[34] S. Mehnaz and M. S. Rahman, "Pairwise compatibility graphs revisited," in 2013 International Conference on Informatics, Electronics and Vision (ICIEV), Dhaka, Bangladesh, 2013, pp. 1-6.
[35] X. He, M. Gao, M.-Y. Kan, and D. Wang, "Birank: Towards ranking on bipartite graphs," IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1, pp. 57-71, 2016.
[36] M. Li and H. Leung, "Multi-view depth-based pairwise feature learning for person-person interaction recognition," Multimedia Tools and Applications, vol. 78, no. 5, pp. 5731-5749, 2019.
[37] B. Liu, H. Cai, X. Ji, and H. Liu, "Human-human interaction recognition based on spatial and motion trend feature," in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 4547-4551: IEEE.
[38] T. Huynh-The et al., "PAM-based flexible generative topic model for 3D interactive activity recognition," in 2015 International Conference on Advanced Technologies for Communications (ATC), 2015, pp. 117-122.
[39] M. S. Ryoo, C.-C. Chen, J. Aggarwal, and A. Roy-Chowdhury, "An overview of contest on semantic description of human activities (SDHA) 2010," in International Conference on Pattern Recognition, 2010, pp. 270-285: Springer.
[40] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, "A closer look at spatiotemporal convolutions for action recognition," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450-6459.
[41] G. Bergami, M. Magnani, and D. Montesi, "A Join Operator for Property Graphs," in EDBT/ICDT Workshops, 2017.
[42] D. M. Cardoso, M. A. A. de Freitas, E. A. Martins, and M. Robbiano, "Spectra of graphs obtained by a generalization of the join graph operation," Discrete Mathematics, vol. 313, no. 5, pp. 733-741, 2013.
[43] B. Yu, H. Yin, and Z. Zhu, "Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting," arXiv preprint arXiv:1709.04875, 2017.
[44] W. Kay et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.
[45] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, "Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 3007-3021, 09 November 2017 2018.
[46] J. Liu, G. Wang, P. Hu, L. Duan, and A. C. Kot, "Global Context-Aware Attention LSTM Networks for 3D Action Recognition," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3671-3680: IEEE.
[47] T. M. Le, N. Inoue, and K. Shinoda, "A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition," arXiv e-prints, Accessed on: May 01, 2018Available: https://ui.adsabs.harvard.edu/abs/2018arXiv180511790L
[48] Y. Qin, L. Mo, C. Li, and J. Luo, "Skeleton-based action recognition by part-aware graph convolutional networks," The visual computer, vol. 36, no. 3, pp. 621-631, 2020.
[49] Y. Ji, G. Ye, and H. Cheng, "Interactive body part contrast mining for human interaction recognition," in 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2014, pp. 1-6: IEEE.
[50] Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110-1118.
[51] W. Zhu et al., "Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks," in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[52] J. Liu, A. Shahroudy, D. Xu, and G. Wang, "Spatio-temporal lstm with trust gates for 3d human action recognition," in European conference on computer vision, 2016, pp. 816-833: Springer.
[53] K. Nour el houda Slimani, Y. Benezeth, and F. Souami, "Human interaction recognition based on the co-occurence of visual words," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 455-460.
[54] M. S. Ryoo and J. K. Aggarwal, "Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities," in 2009 IEEE 12th international conference on computer vision, 2009, pp. 1593-1600: IEEE.
[55] Q. Ke, M. Bennamoun, S. An, F. Boussaid, and F. Sohel, "Human interaction prediction using deep temporal features," in European Conference on Computer Vision, 2016, pp. 403-414: Springer.
[56] T. Lan, T.-C. Chen, and S. Savarese, "A hierarchical representation for future action prediction," in European Conference on Computer Vision, 2014, pp. 689-704: Springer.
[37] B. Liu, H. Cai, X. Ji, and H. Liu, "Human-human interaction recognition based on spatial and motion trend feature," in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 4547-4551: IEEE.
[38] T. Huynh-The et al., "PAM-based flexible generative topic model for 3D interactive activity recognition," in 2015 International Conference on Advanced Technologies for Communications (ATC), 2015, pp. 117-122.
[39] M. S. Ryoo, C.-C. Chen, J. Aggarwal, and A. Roy-Chowdhury, "An overview of contest on semantic description of human activities (SDHA) 2010," in International Conference on Pattern Recognition, 2010, pp. 270-285: Springer.
[40] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, "A closer look at spatiotemporal convolutions for action recognition," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450-6459.
[41] G. Bergami, M. Magnani, and D. Montesi, "A Join Operator for Property Graphs," in EDBT/ICDT Workshops, 2017.
[42] D. M. Cardoso, M. A. A. de Freitas, E. A. Martins, and M. Robbiano, "Spectra of graphs obtained by a generalization of the join graph operation," Discrete Mathematics, vol. 313, no. 5, pp. 733-741, 2013.
[43] B. Yu, H. Yin, and Z. Zhu, "Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting," arXiv preprint arXiv:1709.04875, 2017.
[44] W. Kay et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.
[45] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, "Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 3007-3021, 09 November 2017 2018.
[46] J. Liu, G. Wang, P. Hu, L. Duan, and A. C. Kot, "Global Context-Aware Attention LSTM Networks for 3D Action Recognition," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3671-3680: IEEE.
[47] T. M. Le, N. Inoue, and K. Shinoda, "A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition," arXiv e-prints, Accessed on: May 01, 2018Available: https://ui.adsabs.harvard.edu/abs/2018arXiv180511790L
[48] Y. Qin, L. Mo, C. Li, and J. Luo, "Skeleton-based action recognition by part-aware graph convolutional networks," The visual computer, vol. 36, no. 3, pp. 621-631, 2020.
[49] Y. Ji, G. Ye, and H. Cheng, "Interactive body part contrast mining for human interaction recognition," in 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2014, pp. 1-6: IEEE.
[50] Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110-1118.
[51] W. Zhu et al., "Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks," in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[52] J. Liu, A. Shahroudy, D. Xu, and G. Wang, "Spatio-temporal lstm with trust gates for 3d human action recognition," in European conference on computer vision, 2016, pp. 816-833: Springer.
[53] K. Nour el houda Slimani, Y. Benezeth, and F. Souami, "Human interaction recognition based on the co-occurence of visual words," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 455-460.
[54] M. S. Ryoo and J. K. Aggarwal, "Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities," in 2009 IEEE 12th international conference on computer vision, 2009, pp. 1593-1600: IEEE.
[55] Q. Ke, M. Bennamoun, S. An, F. Boussaid, and F. Sohel, "Human interaction prediction using deep temporal features," in European Conference on Computer Vision, 2016, pp. 403-414: Springer.
[56] T. Lan, T.-C. Chen, and S. Savarese, "A hierarchical representation for future action prediction," in European Conference on Computer Vision, 2014, pp. 689-704: Springer.

QR CODE