Author: |
Melkamu Sewuyie Denekew Melkamu Sewuyie Denekew |
Thesis Title: |
Pose Spatio-Temporal based Human Action Recognition Pose Spatio-Temporal based Human Action Recognition |
Advisor: |
Kai-Lung Hua |
Committee: |
Chao-Lung Yang 陳怡伶 Yi-Ling Chen 花凱龍 Kai-Lung Hua |
Degree: |
碩士 Master |
Department: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
Thesis Publication Year: | 2019 |
Graduation Academic Year: | 107 |
Language: | 英文 |
Pages: | 35 |
Keywords (in Chinese): | Action Recognition 、Feature Descriptor 、Fisher Vector 、Pose Representation |
Keywords (in other languages): | Action Recognition, Feature Descriptor, Fisher Vector, Pose Representation |
Reference times: | Clicks: 558 Downloads: 1 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
Recognizing human actions in video sequences has been a challenging problem in the last few years. Several action representation approaches have been proposed to improve the recognition performance, but many problems still remain unsolved. For example, the representations of skeleton sequences captured by most of the previous methods lack spatial features joint information and lack detailed temporal features motion information. In order to extract human motion information efficiently and improve the accuracy of the human action recognition from video, we propose an approach for pose spatial-temporal based human action recognition using the joint point information instead of using structural information. First, we acquired the joint positions of the human body in every frame of the video. Then, we extracted the pose information using handcrafted features relative to the position of joints and the spatial dimension. We also computed for the change in the temporal dimension. The two sets of features form our human pose spatiotemporal feature descriptors. We then compute a fixed dimension of fisher vectors for each descriptor separately. Finally, we used a weighted fusion technique to classify the action. We evaluated on two public datasets and show that our proposed algorithm achieves 97.8% accuracy on PennAction dataset and 77.7% accuracy on JHMDB dataset, effectively improving the accuracy of the action recognition as compared to previous methods.
Recognizing human actions in video sequences has been a challenging problem in the last few years. Several action representation approaches have been proposed to improve the recognition performance, but many problems still remain unsolved. For example, the representations of skeleton sequences captured by most of the previous methods lack spatial features joint information and lack detailed temporal features motion information. In order to extract human motion information efficiently and improve the accuracy of the human action recognition from video, we propose an approach for pose spatial-temporal based human action recognition using the joint point information instead of using structural information. First, we acquired the joint positions of the human body in every frame of the video. Then, we extracted the pose information using handcrafted features relative to the position of joints and the spatial dimension. We also computed for the change in the temporal dimension. The two sets of features form our human pose spatiotemporal feature descriptors. We then compute a fixed dimension of fisher vectors for each descriptor separately. Finally, we used a weighted fusion technique to classify the action. We evaluated on two public datasets and show that our proposed algorithm achieves 97.8% accuracy on PennAction dataset and 77.7% accuracy on JHMDB dataset, effectively improving the accuracy of the action recognition as compared to previous methods.
[1] Kong, Y., & Fu, Y. (2018). Human Action Recognition and Prediction: A Survey. CoRR, abs/1806.11230.
[2] Choutas, V., Weinzaepfel, P., Revaud, J., & Schmid, C. (2018). PoTion: Pose MoTion Representation for Action Recognition. CVPR.
[3] Wang, H., & Schmid, C. (2013). Action Recognition with Improved Trajectories. 2013 IEEE International Conference on Computer Vision, 3551-3558.
[4] Wang, H., Ullah, M.M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of Local Spatio-temporal Features for Action Recognition. BMVC.
[5] Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice. Computer Vision and Image Understanding, 150, 109-125.
[6] Cao, Z., Simon, T., Wei, S., & Sheikh, Y. (2017). Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1302-1310.
[7] Yang, Y., & Ramanan, D. (2013). Articulated Human Detection with Flexible Mixtures of Parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2878-2890.
[8] Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for Action - Action for Pose. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), 438-445.
[9] Singh, V.K., & Nevatia, R. (2011). Action recognition in cluttered dynamic scenes using Pose-Specific Part Models. 2011 International Conference on Computer Vision, 113-120.
[10] Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 623-630.
[11] Dhamsania, C., & Ratanpara, T.V. (2016). A survey on Human action recognition from videos. 2016 Online International Conference on Green Engineering and Technologies (IC-GET), 1-5.
[12] Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-Based CNN Features for Action Recognition. 2015 IEEE International Conference on Computer Vision (ICCV), 3218-3226.
[13] Zhao, X., Yu, Y., Huang, Y., Huang, K., & Tan, T. (2012). Feature coding via vector difference for image classification. 2012 19th IEEE International Conference on Image Processing, 3121-3124.
[14] Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M.J. (2013). Towards Understanding Action Recognition. 2013 IEEE International Conference on Computer Vision, 3192-3199.
[15] Zhang, W., Zhu, M., & Derpanis, K.G. (2013). From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding. 2013 IEEE International Conference on Computer Vision, 2248-2255.
[16] Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2016). Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors. IJCAI.
[17] Dalal, N., Triggs, B., & Schmid, C. (2006). Human Detection Using Oriented Histograms of Flow and Appearance. ECCV.
[18] Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 1, 886-893 vol. 1.
[19] Wang, H., Kläser, A., Schmid, C., & Liu, C. (2012). Dense Trajectories and Motion Boundary Descriptors for Action Recognition. International Journal of Computer Vision, 103, 60-79.
[20] Simonyan, K., & Zisserman, A. (2014). Two-Stream Convolutional Networks for Action Recognition in Videos. NIPS.
[21] Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. CVPR 2011, 1297-1304.
[22] Chu, X., Ouyang, W., Li, H., & Wang, X. (2016). Structured Feature Learning for Pose Estimation. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4715-4723.
[23] Fan, R., Chang, K., Hsieh, C., Wang, X., & Lin, C. (2008). LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9, 1871-1874.
[24] Chunhui Z. Liguo W., Classification Technique for HSI, in Hyperspectral Image Processing, Beijing, National Defense Industry Press, 2016, pp. 45-77.
[25] John V. MTech, Geometry for Computer Graphics, UK: British Library Cataloguing in Publication , 2005.
[26] Daniel C. Alexander, Geralyn M. Koeberlein, Elementary Geometry for college student, Brooks/Cole, Cengage Learning, 2011.
[27] S. Holzner, Physics I Workbook For Dummies, 2nd Edition, march 2014.
[28] Luvizon, D.C., Tabia, H., & Picard, D. (2017). Learning features combination for human action recognition from skeleton sequences. Pattern Recognition Letters, 99, 13-20.