簡易檢索 / 詳目顯示

研究生: 陳聖方
Sheng-Fang Chen
論文名稱: 利用視角無關以及差分線索解決三維人體動作識別之 LSTM 深度學習技術
LSTM with Hand-crafted View-Invariant and Differential Cues (HVDC) for 3D Human Action Recognition
指導教授: 鍾聖倫
Sheng-Luen Chung
口試委員: 蘇順豐
Shun-Feng Su
郭重顯
Chung-Hsien Kuo
鍾聖倫
Sheng-Luen Chung
徐繼聖
Gee-Sern Hsu
陸敬互
Ching-Hu Lu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 71
中文關鍵詞: 專家裁剪提示動作識別長短期記憶模型人體骨架
外文關鍵詞: hand-crafted cue, action recognition, LSTM, skeleton joints
相關次數: 點閱:480下載:15
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 動作識別的應用廣泛,但因正確的辨識涉及兩種關鍵訊息而深具挑戰,分別是:第一種,型態提示描述身體的肢體組態;第二種,流線提示描述一段時間內肢體的運動軌跡。新近深度學習 (deep learning) 技術對於提昇影像辨識的正確率有很大的進展,但其在動作識別上的正確率仍有待提升,其主要原因是:相較於一般影像辨識用的資料集的來源多且數量充裕,針對動作識別數據集的數量相對較少。據此,基於由深度攝影機所取得反應動作的人體骨架的匯流數據,本論文提出以專家裁切提示誘導 LSTM 深度學習架構為基礎的動作辨識方法。更明確講,針對數量有限的資料集,相對於使用原生骨架數據,本論文就動作辨別最重要的型態 (spatial) 特徵與流線 (temporal) 提示分別採用視角無關轉換 (SVIT)以及等效於行動軌跡的差分 (Diff) 專家裁切提示 (hand-crafted cue),然後在LSTM架構上,就上述兩提示分別透過不同融合技術,在目前文獻上數量最大的NTU-RGB+D以及最小的MSR DailyActivity資料集上進行訓練與測試。本論文所提出整合後的視角無關與差分特徵為輸入的 LSTM辨識方法均優於目前經典文獻上,相同資料集作測試所得的結果。


    Good action recognition relies on correct interpretation of two critical attributes related to action: the spatial attribute on the detected person’s posture, and the temporal attribute on the detected person’s body movement. Whereas deep learning has greatly improved image recognition, we have not found a similar progress for action recognition. One of the main reasons is due to the complexity caused by the additional temporal dimension; another, to the fact that there are less annotated training data samples for action recognition than that for image recognition. In this regard, this paper proposes a handcrafted cued LSTM model for human action recognition based on RGB-D data, as a collection of 25 skeleton joints in 3D coordinates, found in NTU-RGB-D, currently the most comprehensive dataset for action recognition. As opposed to the raw data of skeleton joints, handcrafted cues, pre-processed results geared to facilitate focused learning, are proposed as input to the LSTM structure. In particular, pertaining to the spatial cue, the SVIT cue derived by Skeleton View-invariant Transformation is adopted; pertaining to the temporal cue, the Diff cue computed by taking the displacements of all joint across down-sampled raw data is utilized. Based on the train/test protocol, the experiment we conducted on NTU-RGB-D shows that the recognition result based on either of the proposed handcrafted cues is better than that based on the raw data. In addition, by our proposed techniques of feature fusion and/or decision fusion of these two handcrafted cues, the recognition performance is better than that of the state-of-the-art approaches conducting on the same dataset by same train/test protocol.

    摘要 I ABSTRACT II 致謝 III LIST OF FIGURES VI LIST OF TABLES VII Chapter I. Introduction 1 1.1 Backgrounds 1 1.2 Action Recognition by Deep Learning 1 1.3 Technical Challenges of Action Recognition 2 1.4 Contribution 3 1.5 Paper Organization 4 Chapter II. Literature Survey 5 2.1 CNN Approach on RGB Data for Action Recognition 5 2.2 LSTM Approach on RGB-D Data 6 2.3 Hand-crafted Cues 8 2.4 Fusion to Boot Performance 8 Chapter III. Hand-crafted Skeleton Cues and Fusions on LSTM 11 3.1 Long-short Term Memory (LSTM) in Summary 11 3.2 Skeleton View-Invariant Transform (SVIT) for Spatial Cue 14 3.3 Diff for Temporal Cue 17 3.4 Feature Fusion on SVIT and Diff 19 3.5 Decision Fusion with Additional Streams 22 Chapter IV. Experimental Results 24 4.1 NTU Dataset, Evaluation Protocol and the Platform 24 4.2 SVIT and Diff Cue 26 4.3 Fused SVIT and Diff Cue 28 4.4 Decision Fusion 30 4.5 Compare with the State-of-the-art 33 4.6 Test on Additional Dataset of MSR DailyActivity 35 Chapter V. Conclusion 37 5.1 Conclusion 37 5.2 Future work 38 APPENDIX. 39 REFERENCE 52

    [1] J. K. Aggarwal and M. S. Ryoo, "Human activity analysis: A review," ACM Computing Surveys (CSUR), vol. 43, no. 3, 2011, pp. 16.
    [2] J. K. Aggarwal and L. Xia, "Human activity recognition from 3d data: A review," Pattern Recognition Letters, vol. 48, 2014, pp. 70-80.
    [3] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in neural information processing systems, 2014, pp. 568-576.
    [4] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "NTU RGB+ D: A large scale dataset for 3D human activity analysis," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010-1019.
    [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
    [6] X. Peng and C. Schmid, "Encoding feature maps of cnns for action recognition," 2015.
    [7] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, 1997, pp. 1735-1780.
    [8] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, "An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data," arXiv preprint arXiv:1611.06067, 2016.
    [9] G. Houghton and S. P. Tipper, "A model of inhibitory mechanisms in selective attention," 1994.
    [10] W. Zhu et al., "Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks," arXiv preprint arXiv:1603.07772, 2016.
    [11] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
    [12] Y. Han, S. L. Chung, J. S. Yeh, and Q. J. Chen, "Skeleton-based viewpoint invariant transformation for motion analysis," Journal of Electronic Imaging, vol. 23, no. 4, 2014, pp. 043021-043021.
    [13] E. Park, X. Han, T. L. Berg, and A. C. Berg, "Combining multiple sources of knowledge in deep cnns for action recognition," in Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, 2016, pp. 1-8
    [14] J. Wang, Z. Liu, Y. Wu, and J. Yuan, "Mining actionlet ensemble for action recognition with depth cameras," in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 1290-1297
    [15] H. Liu, J. Tu, and M. Liu, "Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition," arXiv preprint arXiv:1705.08106, 2017.
    [16] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, "Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection," arXiv preprint arXiv:1704.00616, 2017.
    [17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489-4497.
    [18] S. Zhang, X. Liu, and J. Xiao, "On geometric features for skeleton-based action recognition using multilayer LSTM networks," in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, 2017, pp. 148-157.
    [19] J. L. Barron, D. J. Fleet, S. S. Beauchemin, and T. Burkitt, "Performance of optical flow techniques," in Computer Vision and Pattern Recognition, 1992. Proceedings CVPR'92., 1992 IEEE Computer Society Conference on, 1992, pp. 236-242.
    [20] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, "Dense trajectories and motion boundary descriptors for action recognition," International journal of computer vision, vol. 103, no. 1, 2013, p. 60.
    [21] L. Wang, Y. Qiao, and X. Tang, "Action recognition with trajectory-pooled deep-convolutional descriptors," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4305-4314.
    [22] C. Olah, "Understanding LSTM Networks," 2015.
    http://colah.github.io/posts/2015-08-Understanding-LSTMs/"
    [23] J. Liu, A. Shahroudy, D. Xu, and G. Wang, "Spatio-temporal LSTM with trust gates for 3D human action recognition," in European Conference on Computer Vision, 2016, pp. 816-833: Springer.
    [24] Q. Ke, S. An, M. Bennamoun, F. Sohel, and F. Boussaid, "SkeletonNet: Mining Deep Part Features for 3-D Action Recognition," IEEE Signal Processing Letters, vol. 24, no. 6, 2017, pp. 731-735.
    [25] C. Li, Y. Hou, P. Wang, and W. Li, "Joint Distance Maps Based Action Recognition With Convolutional Neural Networks," IEEE Signal Processing Letters, vol. 24, no. 5, 2017, pp. 624-628.
    [26] C. I. Patel, S. Garg, T. Zaveri, A. Banerjee, and R. Patel, "Human action recognition using fusion of features for unconstrained video sequences," Computers & Electrical Engineering, 2016.
    [27] Y. Wang, J. Song, L. Wang, L. Van Gool, and O. Hilliges, "Two-Stream SR-CNNs for Action Recognition in Videos," in BMVC, 2016.
    [28] Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo, "Action recognition by learning deep multi-granular spatio-temporal video representation," in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016, pp. 159-166: ACM.
    [29] C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional two-stream network fusion for video action recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1933-1941.
    [30] T.-Y. Lin, A. RoyChowdhury, and S. Maji, "Bilinear cnn models for fine-grained visual recognition," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1449-1457.
    [31] F. Weninger, J. Geiger, M. Wöllmer, B. Schuller, and G. Rigoll, "Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments," Computer Speech & Language, vol. 28, no. 4, 2014, pp. 888-902.
    [32] W. Li, Z. Zhang, and Z. Liu, "Action recognition based on a bag of 3d points," in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, 2010, pp. 9-14
    [33] L. Xia, C.-C. Chen, and J. Aggarwal, "View invariant human action recognition using histograms of 3d joints," in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, 2012, pp. 20-27.
    [34] J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, "Cross-view action modeling, learning and recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2649-2656.
    [35] A.-A. Liu, W.-Z. Nie, Y.-T. Su, L. Ma, T. Hao, and Z.-X. Yang, "Coupled hidden conditional random fields for RGB-D human action recognition," Signal Processing, vol. 112, 2015, pp. 74-82.
    [36] H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, "Histogram of oriented principal components for cross-view action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 12, 2016, pp. 2430-2443.
    [37] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in Aistats, 2010, vol. 9, pp. 249-256.
    [38] D. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
    [39] P. J. Werbos, "Backpropagation through time: what it does and how to do it," Proceedings of the IEEE, vol. 78, no. 10, 1990, pp. 1550-1560.
    [40] W. Zaremba, I. Sutskever, and O. Vinyals, "Recurrent neural network regularization," arXiv preprint arXiv:1409.2329, 2014.
    [41] G. Evangelidis, G. Singh, and R. Horaud, "Skeletal quads: Human action recognition using joint quadruples," in Pattern Recognition (ICPR), 2014 22nd International Conference on, 2014, pp. 4513-4518.
    [42] R. Vemulapalli, F. Arrate, and R. Chellappa, "Human action recognition by representing 3d skeletons as points in a lie group," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588-595.
    [43] J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang, "Jointly learning heterogeneous features for RGB-D activity recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5344-5352.
    [44] L. L. Presti and M. La Cascia, "3D skeleton-based human action classification: a survey," Pattern Recognition, vol. 53, 2016, pp. 130-147.
    [45] M. Zanfir, M. Leordeanu, and C. Sminchisescu, "The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection," in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2752-2759.
    [46] L. Tao and R. Vidal, "Moving poselets: A discriminative and interpretable skeletal motion representation for action recognition," in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 61-69.
    [47] A. Shahroudy, T.-T. Ng, Q. Yang, and G. Wang, "Multimodal multipart learning for action recognition in depth videos," IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, 2016, pp. 2123-2129.
    [48] J. Luo, W. Wang, and H. Qi, "Group sparsity and geometry constrained dictionary learning for action recognition from depth maps," in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1809-1816.

    QR CODE