簡易檢索 / 詳目顯示

研究生: Richard
Richard
論文名稱: 基於動作辨識標準清潔程序符合性確認之研究
Conformance Checking on Self-cleaning Procedures based on Activity Recognition
指導教授: 周碩彥
Shuo-Yan Chou
郭伯勳
Po-Hsun Kuo
口試委員: 喻奉天
Vincent F. Yu
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 46
中文關鍵詞: Convolutional Neural NetworkDynamic Sliding WindowHuman Skeleton Key PointsHuman Activity RecognitionK-MeansLSTM
外文關鍵詞: Convolutional Neural Network, Dynamic Sliding Window, Human Skeleton Key Points, Human Activity Recognition, K-Means, LSTM
相關次數: 點閱:276下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

In this study, a novel framework for human activity recognition is proposed by using human skeleton key points and dynamic sliding window for real-time prediction. The proposed framework starts with the data processing phase, where the human skeleton key points and the human posture are converted to motion features for each frame. Especially, human activity can be represented by a sequence of motion features extracted from all of frames video. Afterward, a learning strategy is proposed in this work by grouping the motion features via K-Means and classifying the human activity via Long Short-Term Memory (LSTM).
In real-time human activity recognition, most of the methods still rely on a static sliding window for extracting the information. However, this method will lead to misclassification because of the different duration between real-time and training data. On other words, there is a high possibility that the system only can extract partial information. To address this issue, the Convolution Neural Network (CNN) is proposed in this research to obtain a dynamic sliding window size based on the similarity of human posture from each frame.
Comprehensive experiments are conducted on human activity recognition dataset for Self-Cleaning Standard Operation Procedure (SC-SOP). The results show that the proposed framework achieves promising performance in streaming data.


In this study, a novel framework for human activity recognition is proposed by using human skeleton key points and dynamic sliding window for real-time prediction. The proposed framework starts with the data processing phase, where the human skeleton key points and the human posture are converted to motion features for each frame. Especially, human activity can be represented by a sequence of motion features extracted from all of frames video. Afterward, a learning strategy is proposed in this work by grouping the motion features via K-Means and classifying the human activity via Long Short-Term Memory (LSTM).
In real-time human activity recognition, most of the methods still rely on a static sliding window for extracting the information. However, this method will lead to misclassification because of the different duration between real-time and training data. On other words, there is a high possibility that the system only can extract partial information. To address this issue, the Convolution Neural Network (CNN) is proposed in this research to obtain a dynamic sliding window size based on the similarity of human posture from each frame.
Comprehensive experiments are conducted on human activity recognition dataset for Self-Cleaning Standard Operation Procedure (SC-SOP). The results show that the proposed framework achieves promising performance in streaming data.

Abstract i Acknowledgement ii Table of Contents iii List of Figures iv List of Tables v Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Objective and Limitation 3 1.3 The Method of Research 3 1.4 Organization of the Thesis 4 Chapter 2 Literature Review 5 2.1 Related work for Human Activity Recognition 5 2.2 OpenPose Key Points 6 2.3 K-Means Clustering Algorithm 9 2.4 Long Short-Term Memory (LSTM) 10 2.5 CNN based on YOLO Configuration 11 Chapter 3 The Methodology 13 3.1 Data Collection 13 3.2 Data Processing 14 3.2.1 Skeleton Feature Extraction 14 3.2.2 Multi-modal Features Fusion 17 3.2.3 Posture Selection 18 3.3 Data Modeling 18 3.4 System Architecture for Non-Streaming Activity Recognition 20 3.5 System Architecture for Real-time Activity Recognition 22 Chapter 4 Experiments and Discussion 24 4.1 Dataset Description 24 4.2 Hardware Specification 25 4.3 Experimental Results 26 Chapter 5 Conclusion and Future Research 32 5.1 Conclusion 32 5.2 Future Research 32 References 34

[1] L. Zhou, W. Li, P. Ogunbona, and Z. Zhang, "Semantic action recognition by learning a pose lexicon," Pattern Recognition, vol. 72, pp. 548-562, 2017.
[2] L. Wang, Y. Qiao, and X. Tang, "Action recognition with trajectory-pooled deep-convolutional descriptors," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4305-4314.
[3] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, "Real-time action recognition with enhanced motion vector CNNs," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2718-2726.
[4] H. S. Koppula and A. Saxena, "Anticipating human activities using object affordances for reactive robotic response," IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 14-29, 2015.
[5] L. Liu, Y. Peng, M. Liu, and Z. Huang, "Sensor-based human activity recognition system with a multilayered model using time series shapelets," Knowledge-Based Systems, vol. 90, pp. 138-152, 2015.
[6] F. Zhu, L. Shao, J. Xie, and Y. Fang, "From handcrafted to learned representations for human action recognition: A survey," Image and Vision Computing, vol. 55, pp. 42-52, 2016.
[7] L. L. Presti and M. La Cascia, "3D skeleton-based human action classification: A survey," Pattern Recognition, vol. 53, pp. 130-147, 2016.
[8] F. Han, B. Reily, W. Hoff, and H. Zhang, "Space-time representation of people based on 3D skeletal data: A review," Computer Vision and Image Understanding, vol. 158, pp. 85-105, 2017.
[9] J. Shotton et al., "Real-time human pose recognition in parts from single depth images," Communications of the ACM, vol. 56, no. 1, pp. 116-124, 2013.
[10] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, "OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields," arXiv preprint arXiv:1812.08008, 2018.
[11] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7291-7299.
[12] Q. Li, W. Lin, and J. Li, "Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images," Signal Processing: Image Communication, vol. 68, pp. 265-272, 2018.
[13] A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE international conference on acoustics, speech and signal processing, 2013: IEEE, pp. 6645-6649.
[14] I. Sutskever, O. Vinyals, and Q. Le, "Sequence to sequence learning with neural networks," Advances in NIPS, 2014.
[15] T. Mikolov, S. Kombrink, L. Burget, J. Černocký, and S. Khudanpur, "Extensions of recurrent neural network language model," in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011: IEEE, pp. 5528-5531.
[16] M. Sundermeyer, R. Schlüter, and H. Ney, "LSTM neural networks for language modeling," in Thirteenth annual conference of the international speech communication association, 2012.
[17] G. Mesnil, X. He, L. Deng, and Y. Bengio, "Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding," in Interspeech, 2013, pp. 3771-3775.
[18] K. Xu et al., "Show, attend and tell: Neural image caption generation with visual attention," in International conference on machine learning, 2015, pp. 2048-2057.
[19] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: Lessons learned from the 2015 mscoco image captioning challenge," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 652-663, 2016.
[20] N. Srivastava, E. Mansimov, and R. Salakhudinov, "Unsupervised learning of video representations using lstms," in International conference on machine learning, 2015, pp. 843-852.
[21] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, "A multi-stream bi-directional recurrent neural network for fine-grained action detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1961-1970.
[22] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, "Structural-RNN: Deep learning on spatio-temporal graphs," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5308-5317.
[23] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, "Social lstm: Human trajectory prediction in crowded spaces," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 961-971.
[24] Z. Deng, A. Vahdat, H. Hu, and G. Mori, "Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4772-4781.
[25] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, "A hierarchical deep temporal model for group activity recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1971-1980.
[26] S. Ma, L. Sigal, and S. Sclaroff, "Learning activity progression in lstms for activity detection and early detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1942-1950.
[27] B. Ni, X. Yang, and S. Gao, "Progressively parsing interactional objects for fine grained action detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1020-1028.
[28] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, "Beyond short snippets: Deep networks for video classification," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4694-4702.
[29] J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625-2634.
[30] Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo, "Action recognition by learning deep multi-granular spatio-temporal video representation," in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016: ACM, pp. 159-166.
[31] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, "Modeling spatial-temporal clues in a hybrid deep learning framework for video classification," in Proceedings of the 23rd ACM international conference on Multimedia, 2015: ACM, pp. 461-470.
[32] T. Shu, S. Todorovic, and S.-C. Zhu, "CERN: confidence-energy recurrent network for group activity recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5523-5531.
[33] M. Wang, B. Ni, and X. Yang, "Recurrent modeling of interaction context for collective activity recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3048-3056.
[34] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, "Skeleton-based action recognition using spatio-temporal LSTM network with trust gates," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 3007-3021, 2017.
[35] E. Cippitelli, S. Gasparrini, E. Gambi, and S. Spinsante, "A human activity recognition system using skeleton data from rgbd sensors," Computational intelligence and neuroscience, vol. 2016, p. 21, 2016.
[36] J. R. Padilla-López, A. A. Chaaraoui, and F. Flórez-Revuelta, "A discussion on the validation tests employed to compare human action recognition methods using the msr action3d dataset," arXiv preprint arXiv:1407.7390, 2014.
[37] L. Gan and F. Chen, "Human Action Recognition Using APJ3D and Random Forests," JSW, vol. 8, no. 9, pp. 2238-2245, 2013.
[38] L. Xia, C.-C. Chen, and J. K. Aggarwal, "View invariant human action recognition using histograms of 3d joints," in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012: IEEE, pp. 20-27.
[39] G. Lu, Y. Zhou, X. Li, and M. Kudo, "Efficient action recognition via local position offset of 3D skeletal body joints," Multimedia Tools and Applications, vol. 75, no. 6, pp. 3479-3494, 2016.
[40] M. A. Gowayyed, M. Torki, M. E. Hussein, and M. El-Saban, "Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition," in Twenty-Third International Joint Conference on Artificial Intelligence, 2013.
[41] O. Oreifej and Z. Liu, "Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 716-723.
[42] U. M. Nunes, D. R. Faria, and P. Peixoto, "A human activity recognition framework using max-min features and key poses with differential evolution random forests classifier," Pattern Recognition Letters, vol. 99, pp. 21-31, 2017.
[43] G. Chen, D. Clarke, M. Giuliani, A. Gaschler, and A. Knoll, "Combining unsupervised learning and discrimination for 3D action recognition," Signal Processing, vol. 110, pp. 67-81, 2015.
[44] S. Althloothi, M. H. Mahoor, X. Zhang, and R. M. Voyles, "Human activity recognition using multi-features and multiple kernel learning," Pattern recognition, vol. 47, no. 5, pp. 1800-1812, 2014.
[45] Y. Zhu, W. Chen, and G. Guo, "Fusing spatiotemporal features and joints for 3d action recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 486-491.
[46] J. Luo, W. Wang, and H. Qi, "Spatio-temporal feature extraction and representation for RGB-D human action recognition," Pattern Recognition Letters, vol. 50, pp. 139-148, 2014.
[47] A.-A. Liu, W.-Z. Nie, Y.-T. Su, L. Ma, T. Hao, and Z.-X. Yang, "Coupled hidden conditional random fields for RGB-D human action recognition," Signal Processing, vol. 112, pp. 74-82, 2015.
[48] Wikipedia. https://en.wikipedia.org/wiki/K-means_clustering (accessed February 2, 2020).
[49] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[50] F. A. Gers and J. Schmidhuber, "Recurrent nets that time and count," in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, 2000, vol. 3: IEEE, pp. 189-194.
[51] J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.
[52] S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

無法下載圖示 全文公開日期 2025/02/03 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE