Basic Search / Detailed Display

Author: Melkamu Sewuyie Denekew
Melkamu Sewuyie Denekew
Thesis Title: Pose Spatio-Temporal based Human Action Recognition
Pose Spatio-Temporal based Human Action Recognition
Advisor: 花凱龍
Kai-Lung Hua
Committee: 楊朝龍
Chao-Lung Yang
陳怡伶
Yi-Ling Chen
花凱龍
Kai-Lung Hua
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2019
Graduation Academic Year: 107
Language: 英文
Pages: 35
Keywords (in Chinese): Action RecognitionFeature DescriptorFisher VectorPose Representation
Keywords (in other languages): Action Recognition, Feature Descriptor, Fisher Vector, Pose Representation
Reference times: Clicks: 309Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

Recognizing human actions in video sequences has been a challenging problem in the last few years. Several action representation approaches have been proposed to improve the recognition performance, but many problems still remain unsolved. For example, the representations of skeleton sequences captured by most of the previous methods lack spatial features joint information and lack detailed temporal features motion information. In order to extract human motion information efficiently and improve the accuracy of the human action recognition from video, we propose an approach for pose spatial-temporal based human action recognition using the joint point information instead of using structural information. First, we acquired the joint positions of the human body in every frame of the video. Then, we extracted the pose information using handcrafted features relative to the position of joints and the spatial dimension. We also computed for the change in the temporal dimension. The two sets of features form our human pose spatiotemporal feature descriptors. We then compute a fixed dimension of fisher vectors for each descriptor separately. Finally, we used a weighted fusion technique to classify the action. We evaluated on two public datasets and show that our proposed algorithm achieves 97.8% accuracy on PennAction dataset and 77.7% accuracy on JHMDB dataset, effectively improving the accuracy of the action recognition as compared to previous methods.


Recognizing human actions in video sequences has been a challenging problem in the last few years. Several action representation approaches have been proposed to improve the recognition performance, but many problems still remain unsolved. For example, the representations of skeleton sequences captured by most of the previous methods lack spatial features joint information and lack detailed temporal features motion information. In order to extract human motion information efficiently and improve the accuracy of the human action recognition from video, we propose an approach for pose spatial-temporal based human action recognition using the joint point information instead of using structural information. First, we acquired the joint positions of the human body in every frame of the video. Then, we extracted the pose information using handcrafted features relative to the position of joints and the spatial dimension. We also computed for the change in the temporal dimension. The two sets of features form our human pose spatiotemporal feature descriptors. We then compute a fixed dimension of fisher vectors for each descriptor separately. Finally, we used a weighted fusion technique to classify the action. We evaluated on two public datasets and show that our proposed algorithm achieves 97.8% accuracy on PennAction dataset and 77.7% accuracy on JHMDB dataset, effectively improving the accuracy of the action recognition as compared to previous methods.

Table of Contents Abstract i List of Tables v List of Figures vi Chapter One 1 1. Introduction 1 1.1. Background 1 1.1. Related Work 4 1.2. Contribution 5 1.3. Approach 7 Chapter Two 8 2. Pose Spatio -Temporal based Human Action Recognition 8 2.1. Pose Estimation 8 2.2. Extract Time and Space Features 8 2.2.1. Normalization Human Body Coordinates 9 2.2.2. Extract Spatial Temporal Features 9 2.3. Feature Coding 18 2.4. Action Recognition 19 2.4.1. Weighted Fusion 20 2.4.2. Feature Classification 21 Chapter Three 22 3. Experiments and Discussion 22 3.1. Datasets 22 3.2. Experiment Result 23 3.2.1. JHMDB Activity Dataset Experimental Result 23 3.2.2. Evaluate Different Features 25 3.2.3. Feature Code Representations 25 3.2.4. Evaluate Different Limb Parts of Human Body 27 3.2.5. Evaluate and Compare with other Methods 28 3.3. Penn Action Dataset Experimental Results 30 Chapter Four 32 4. Conclusion and Future Work 32 4.1 Conclusion 32 4.2 Future Work 32 Reference 33

[1] Kong, Y., & Fu, Y. (2018). Human Action Recognition and Prediction: A Survey. CoRR, abs/1806.11230.
[2] Choutas, V., Weinzaepfel, P., Revaud, J., & Schmid, C. (2018). PoTion: Pose MoTion Representation for Action Recognition. CVPR.
[3] Wang, H., & Schmid, C. (2013). Action Recognition with Improved Trajectories. 2013 IEEE International Conference on Computer Vision, 3551-3558.
[4] Wang, H., Ullah, M.M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of Local Spatio-temporal Features for Action Recognition. BMVC.
[5] Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice. Computer Vision and Image Understanding, 150, 109-125.
[6] Cao, Z., Simon, T., Wei, S., & Sheikh, Y. (2017). Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1302-1310.
[7] Yang, Y., & Ramanan, D. (2013). Articulated Human Detection with Flexible Mixtures of Parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2878-2890.
[8] Iqbal, U., Garbade, M., & Gall, J. (2017). Pose for Action - Action for Pose. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), 438-445.
[9] Singh, V.K., & Nevatia, R. (2011). Action recognition in cluttered dynamic scenes using Pose-Specific Part Models. 2011 International Conference on Computer Vision, 113-120.
[10] Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 623-630.
[11] Dhamsania, C., & Ratanpara, T.V. (2016). A survey on Human action recognition from videos. 2016 Online International Conference on Green Engineering and Technologies (IC-GET), 1-5.
[12] Chéron, G., Laptev, I., & Schmid, C. (2015). P-CNN: Pose-Based CNN Features for Action Recognition. 2015 IEEE International Conference on Computer Vision (ICCV), 3218-3226.
[13] Zhao, X., Yu, Y., Huang, Y., Huang, K., & Tan, T. (2012). Feature coding via vector difference for image classification. 2012 19th IEEE International Conference on Image Processing, 3121-3124.
[14] Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M.J. (2013). Towards Understanding Action Recognition. 2013 IEEE International Conference on Computer Vision, 3192-3199.
[15] Zhang, W., Zhu, M., & Derpanis, K.G. (2013). From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding. 2013 IEEE International Conference on Computer Vision, 2248-2255.
[16] Cao, C., Zhang, Y., Zhang, C., & Lu, H. (2016). Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors. IJCAI.
[17] Dalal, N., Triggs, B., & Schmid, C. (2006). Human Detection Using Oriented Histograms of Flow and Appearance. ECCV.
[18] Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 1, 886-893 vol. 1.
[19] Wang, H., Kläser, A., Schmid, C., & Liu, C. (2012). Dense Trajectories and Motion Boundary Descriptors for Action Recognition. International Journal of Computer Vision, 103, 60-79.
[20] Simonyan, K., & Zisserman, A. (2014). Two-Stream Convolutional Networks for Action Recognition in Videos. NIPS.
[21] Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. CVPR 2011, 1297-1304.
[22] Chu, X., Ouyang, W., Li, H., & Wang, X. (2016). Structured Feature Learning for Pose Estimation. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4715-4723.
[23] Fan, R., Chang, K., Hsieh, C., Wang, X., & Lin, C. (2008). LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9, 1871-1874.
[24] Chunhui Z. Liguo W., Classification Technique for HSI, in Hyperspectral Image Processing, Beijing, National Defense Industry Press, 2016, pp. 45-77.
[25] John V. MTech, Geometry for Computer Graphics, UK: British Library Cataloguing in Publication , 2005.
[26] Daniel C. Alexander, Geralyn M. Koeberlein, Elementary Geometry for college student, Brooks/Cole, Cengage Learning, 2011.
[27] S. Holzner, Physics I Workbook For Dummies, 2nd Edition, march 2014.
[28] Luvizon, D.C., Tabia, H., & Picard, D. (2017). Learning features combination for human action recognition from skeleton sequences. Pattern Recognition Letters, 99, 13-20.

QR CODE