應用時空圖卷積神經網路於人體實時動作辨識及動作轉換預測

簡易檢索 / 詳目顯示

回結果列表

研究生：	王思豪 SI-HAO WANG
論文名稱：	應用時空圖卷積神經網路於人體實時動作辨識及動作轉換預測 Applying Spatial Temporal Graph Convolutional Networks to Online Human Action Recognition and Transition Prediction
指導教授：	楊朝龍 Chao-Lung Yang
口試委員:	花凱龍 Kai-Lung Hua 許嘉裕 Chia-Yu Hsu
學位類別：	碩士 Master
系所名稱：	管理學院 - 工業管理系 Department of Industrial Management
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	62
中文關鍵詞：	人體實時動作辨識、圖卷積網路、動作轉換
外文關鍵詞：	Online HAR, GCN, Action transitions
相關次數：	點閱：169 下載：6
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在工業4.0發展的浪潮下，如何使協作機器人具備更優秀的視覺感知能力是個相當重要的議題。其中包括即時辨識作業員正在執行的動作，以及提前知道作業員何時將要進行下一個動作。然而兩動作轉換的時間很短，而且轉換時會出現動作的模糊區間，這個區間的準確度常常很難辨識。本研究利用時空圖卷積網路開發了一個機器學習框架，能在模擬實時辨識的情況下，辨識作業員的動作並同時預測作業員動作是否將要轉換。本研究首先以人體動作表示模型(Mediapipe框架)進行人體關節點辨識，接著使用滑動窗口演算法(sliding window algorithm)對數據進行切割，並以機率分配的方式為每一段動作進行多標籤定義。利用STGCN針對定義好之多標籤資料進行訓練，學習實時動作的辨識。並將其輸出作為極限梯度提升模型(XGBoost)之輸入，進行動作轉換區間的辨識。為符合本研究對具有組裝動作情境之研究需求，本研究收集8位參與者組裝主機板動作共58部完整作業影片為模型使用之資料集。利用該資料集進行本研究提出的方法與其他方法的比較。實驗結果發現本研究提出之方法可獲得相對極為可靠的98.23%準確率，且具有最快的執行時間。針對動作轉換區間，提出的方法獲得了92.64%準確率，優於原本的驗證方式。

In the wave of Industry 4.0 development, enabling collaborative robots with enhanced visual perception capability is a critical issue. This includes real-time recognition of operator actions and anticipation of their next actions. However, the transition between two actions is often short and blurry, making it challenging to accurately identify the transitional period. In this study, we developed a machine learning framework using spatio-temporal graph convolutional networks. This framework is capable of simulating real-time action recognition and predicting action transitions for operators. Firstly, we utilized the action representation model (Mediapipe framework) for human joint keypoints detection, and then applied a sliding window algorithm to segment the data. Probability distribution was assigned to each segmented action for multi-label definition. The STGCN model was trained on this labeled data to achieve online action recognition. The output from the STGCN model was used as input for the eXtreme Gradient Boosting (XGBoost), which was employed to recognize the transitional periods between actions. To meet the research requirements in the context of assembly tasks, we collected a comprehensive dataset consisting of 58 complete videos of 8 participants assembling motherboards. The proposed method was compared with other approaches for action recognition using this dataset. The experimental results demonstrated the reliability of our proposed method with an accuracy of 98.23% and the fastest execution time. For the transitional periods, our method achieved an accuracy of 92.64%, outperforming the baseline approach.

摘要    i
ABSTRACT    ii
致謝    iii
TABLE OF CONTENTS    iv
LIST OF FIGURES    vi
LIST OF TABLES    vii
CHAPTER 1. INTRODUCTION    1
1.1.    The Status of the Manufacturing Industry    1
1.2.    The Status of Applying Human Action Recognition to the Manufacturing Industry    2
1.3.    Thesis Structure    2
CHAPTER 2. LITERATURE REVIEW    3
2.1    Human Action Recognition    3
2.2    Action Representation    4
2.3    Action Classification    6
2.4    Online Recognition    9
CHAPTER 3. METHODOLOGY    10
3.1.    Research Framework    10
3.2.    Action Representation Model    12
3.3.    Data Pre-Processing    17
3.4.    STGCN    18
3.5.    Probability Filter    22
3.6.    Action Transitions Prediction    24
CHAPTER 4. EXPERIMENTS AND RESULTS    29
4.1.    Data and Label    29
4.1.1.    Data Acquisition    29
4.1.2.    Data Labeling    33
4.2.    Implementation    34
4.2.1.    Configuration    34
4.2.2.    Performance Evaluation    35
4.3.    Experiments and Results    37
4.3.1.    Experiments of HAR Prediction    37
4.3.2.    Experiments of Action Transfer Prediction    41
4.4.    Result Discussion    43
CHAPTER 5. CONCLUSION    46
5.1.    Conclusion    46
5.2.    Future Work    47
REFERENCES    49


                                

[1] H. Lasi, P. Fettke, H.-G. Kemper et al., "Industry 4.0," Business & information systems engineering, vol. 6, pp. 239-242, 2014.
[2] S. Teerasoponpong and P. Sugunnasil, "Review on Artificial Intelligence Applications in Manufacturing Industrial Supply Chain–Industry 4.0's Perspective," in 2022 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), Online, 2022/01/26 - 2022/01/28 2022: IEEE, pp. 406-411.
[3] R. Salehzadeh, J. Gong, and N. Jalili, "Purposeful Communication in Human-Robot Collaboration: A Review of Modern Approaches in Manufacturing," IEEE Access, pp. 129344 - 129361, 2022.
[4] J. G. Pandya and N. P. Maniar, "Computer Vision-Guided Human–Robot Collaboration for Industry 4.0: A Review," Recent Advances in Mechanical Infrastructure: Proceedings of ICRAM 2021, pp. 147-155, 2022.
[5] M. Javaid, A. Haleem, R. P. Singh et al., "Substantial capabilities of robotics in enhancing industry 4.0 implementation," Cognitive Robotics, vol. 1, pp. 58-75, 2021.
[6] D. Moutinho, L. F. Rocha, C. M. Costa et al., "Deep learning-based human action recognition to leverage context awareness in collaborative assembly," Robotics and Computer-Integrated Manufacturing, vol. 80, p. 102449, 2023.
[7] C.-L. Yang, S.-C. Hsu, Y.-W. Hsu et al., "Human action recognition on exceptional movement of worker operation," in Advances in Manufacturing, Production Management and Process Control: Proceedings of the AHFE 2021 Virtual Conferences on Human Aspects of Advanced Manufacturing, Advanced Production Management and Process Control, and Additive Manufacturing, Modeling Systems and 3D Prototyping, July 25-29, 2021, USA, Online, 2021/01/25 - 2021/01/29 2021: Springer, pp. 376-383.
[8] Y. Kong and Y. Fu, "Human action recognition and prediction: A survey," International Journal of Computer Vision, vol. 130, no. 5, pp. 1366-1401, 2022.
[9] G. Guo and A. Lai, "A survey on still image based human action recognition," Pattern Recognition, vol. 47, no. 10, pp. 3343-3361, 2014.
[10] Z. Sun, Q. Ke, H. Rahmani et al., "Human action recognition from various data modalities: A review," IEEE transactions on pattern analysis and machine intelligence, pp. 3200 - 3225, 2022.
[11] F. Yang, D. Li, and G. Wang, "Spatial Temporal Block Transformer Network for Skeleton-Based Action Recognition," in 2022 China Automation Congress (CAC), Xiamen, China, 2022/12/25 - 2022/12-27 2022: IEEE, pp. 1259-1264.
[12] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), San Diego, CA, USA, 2005/06/20 - 2005/06/25 2005, vol. 1: Ieee, pp. 886-893.
[13] A. F. Bobick and J. W. Davis, "The recognition of human movement using temporal templates," IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 3, pp. 257-267, 2001.
[14] D. Weinland, R. Ronfard, and E. Boyer, "Free viewpoint action recognition using motion history volumes," Computer vision and image understanding, vol. 104, no. 2-3, pp. 249-257, 2006.
[15] L. Gorelick, M. Blank, E. Shechtman et al., "Actions as space-time shapes," IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 12, pp. 2247-2253, 2007.
[16] I. Laptev, "On space-time interest points," International journal of computer vision, vol. 64, pp. 107-123, 2005.
[17] Y. Wang and G. Mori, "Hidden part models for human action recognition: Probabilistic versus max margin," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 7, pp. 1310-1323, 2010.
[18] Z. Cao, T. Simon, S.-E. Wei et al., "Realtime multi-person 2d pose estimation using part affinity fields," in Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 2017/07/21 - 2017/07/26 2017: IEEE, pp. 7291-7299.
[19] V. Bazarevsky, Y. Kartynnik, A. Vakunov et al., "Blazeface: Sub-millisecond neural face detection on mobile gpus," arXiv preprint arXiv:1907.05047, 2019.
[20] V. Bazarevsky, I. Grishchenko, K. Raveendran et al., "Blazepose: On-device real-time body pose tracking," arXiv preprint arXiv:2006.10204, 2020.
[21] M. Marszalek, I. Laptev, and C. Schmid, "Actions in context," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009/06/20 - 2009/06/25 2009: IEEE, pp. 2929-2936.
[22] N. Ikizler and D. Forsyth, "Searching video for complex activities with finite state models," in 2007 IEEE Conference on computer vision and pattern recognition, Minneapolis, MN, USA, 2007/06/17 - 2007/06/22 2007: IEEE, pp. 1-8.
[23] C. Schuldt, I. Laptev, and B. Caputo, "Recognizing human actions: a local SVM approach," in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Cambridge, UK, 2004/08/26 - 2004/08/26 2004, vol. 3: IEEE, pp. 32-36.
[24] C. Fanti, L. Zelnik-Manor, and P. Perona, "Hybrid models for human motion recognition," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005/06/20 - 2005/06/25 2005, vol. 1: IEEE, pp. 1166-1173.
[25] L. Wang and D. Suter, "Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model," in 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 2007/06/17 - 2007/06/22 2007: IEEE, pp. 1-8.
[26] S. Ji, W. Xu, M. Yang et al., "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, 2012.
[27] A. Karpathy, G. Toderici, S. Shetty et al., "Large-scale video classification with convolutional neural networks," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Greater Columbus Convention Center in Columbus, Ohio, 2014/06/24 - 2014/06/27 2014: IEEE, pp. 1725-1732.
[28] Y.-F. Song, Z. Zhang, C. Shan et al., "Richly activated graph convolutional network for robust skeleton-based action recognition," IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1915-1925, 2020.
[29] H. Duan, Y. Zhao, K. Chen et al., "Revisiting skeleton-based action recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Louisiana, 2022/06/21 - 2022/06/24 2022: IEEE, pp. 2969-2978.
[30] T. Ahmad, L. Jin, L. Lin et al., "Skeleton-based action recognition using sparse spatio-temporal GCN with edge effective resistance," Neurocomputing, vol. 423, pp. 389-398, 2021.
[31] S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Proceedings of the AAAI conference on artificial intelligence, New Orleans, Louisiana, USA, 2018/02/02 - 2018/02/07 2018, vol. 32, no. 1: AAAI.
[32] W. Wu, F. Tu, M. Niu et al., "STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition," IEEE Transactions on Circuits and Systems I: Regular Papers, pp. 2370 - 2383, 2023.
[33] L. Shi, Y. Zhang, J. Cheng et al., "Two-stream adaptive graph convolutional networks for skeleton-based action recognition," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, California, USA, 2019/06/16 - 2019/06/20 2019: IEEE, pp. 12026-12035.
[34] Z. Chen, S. Li, B. Yang et al., "Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition," in Proceedings of the AAAI conference on artificial intelligence, 2021, vol. 35, no. 2, pp. 1113-1122.
[35] B. Hu, J. Yuan, and Y. Wu, "Discriminative action states discovery for online action recognition," IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1374-1378, 2016.
[36] M. Delamare, C. Laville, A. Cabani et al., "Graph Convolutional Networks Skeleton-based Action Recognition for Continuous Data Stream: A Sliding Window Approach," in VISIGRAPP (5: VISAPP), Online, 2021/12/08 - 2021/12/10 2021: SCITEPRESS, pp. 427-435.
[37] M. Dallel, V. Havard, Y. Dupuis et al., "A sliding window based approach with majority voting for online human action recognition using spatial temporal graph convolutional neural networks," in 2022 7th International Conference on Machine Learning Technologies (ICMLT), Oslo, Norway, 2022/03/11 - 2022/03/13 2022: Association for Computing MachineryNew YorkNYUnited States, pp. 155-163.
[38] C. Lugaresi, J. Tang, H. Nash et al., "Mediapipe: A framework for building perception pipelines," arXiv preprint arXiv:1906.08172, 2019.
[39] F. Zhang, V. Bazarevsky, A. Vakunov et al., "Mediapipe hands: On-device real-time hand tracking," arXiv preprint arXiv:2006.10214, 2020.
[40] W. Liu, D. Anguelov, D. Erhan et al., "Ssd: Single shot multibox detector," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Amsterdam, The Netherlands, 2016/10/11 - 2016/10/14 2016: Springer, pp. 21-37.
[41] T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, San Francisco, CA, USA, 2016/08/13 - 2016/08/17 2016: ACM New York, NY 2016, pp. 785-794.

全文公開日期 2025/07/17 (校外網路)
全文公開日期 2025/07/17 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文