利用時空圖卷積網路的動素識別進行手部組裝動作預測研究

簡易檢索 / 詳目顯示

回結果列表

研究生：	林晏平 Yen-Ping Lin
論文名稱：	利用時空圖卷積網路的動素識別進行手部組裝動作預測研究 Hand Assemble Action Prediction using Therblig Recognition by Spatial Temporal Graph Convolutional Network
指導教授：	楊朝龍 Chao-Lung Yang
口試委員:	花凱龍 Kai-Lung Hua 王孔政 Kung-Jeng Wang
學位類別：	碩士 Master
系所名稱：	管理學院 - 工業管理系 Department of Industrial Management
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	71
中文關鍵詞：	骨架動作辨識、時空圖卷積網路、組裝生產線、動素、貼標
外文關鍵詞：	skeleton-based hand gesture recognition, Spatial Temporal Graph Convolutional Networks, assembly line, Therblig, labeling
相關次數：	點閱：200 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本研究旨在開發一個基於手部骨架動作辨識技術於組裝生產作業現場的組裝動作分析架構。在Gilbreth提出的動作研究中，工作場域中的所有手部動作（Action）都由17種動素（Therblig）組成。本研究利用動素分析的概念，針對兩個主題進行研究：1) 分析使用相同的影像資料時，使用不同的標籤與關節數量是否會影響人體動作辨識模型的預測準確度；2) 提出一個以人因工程的動素辨識進而推測組裝動作的方法。本研究以骨架資訊擷取套件OpenPose所輸出的骨架資訊，利用時空圖卷積網路（Spatial Temporal - Graph Convolutional Network, ST-GCN）對組裝作業員的連續裝配動作或動素進行辨識。首先，先針對不同的標籤與關節數量進行人體動作辨識模型的預測準確度分析，然後再藉由動素辨識結果的組合，使用動態規劃方法（Dynamic Time Warping, DTW）來預測可能的組裝動作，以達到預測動作的目的。本實驗以主機板組裝中常見的動作作為模型訓練及測試。第一部分實驗結果發現，動作標籤搭配26個關節點可使辨識模型得到最佳的準確度，而動素標籤則依照動素定義僅需要慣用手23點的資訊即可得到最佳準確度。第二部分的實驗研究結果顯示以動素預測結果預測動作準確率為73.75%，而使用動素結合相應物件（Therblig-Item）的預測結果預測動作準確度可以提升至81.25%，顯示動素結合相對應物件可更有效預地測出動作。本研究結果可發現利用動素辨識可降低更換動作造成重新訓練模型的訓練成本。

This research aims to develop a hand gesture recognition framework for an assembly production site. In human motion study, Gilbreth considers that all hand actions in the workplace are composed of 17 therbligs. This research utilizes the concept of therblig analysis to investigate two topics: 1) analyze the usage of different labels and number of joints which might affect the prediction accuracy of human action recognition models when using the same image data. 2) propose a method for inferring assembly actions by using the result of therblig recognition. In this study, the skeleton information output from OpenPose was used to recognize the human action of the assembly operation using Spatial Temporal - Graph Convolutional Networks (ST-GCN). First, the prediction accuracy of the human action recognition model under different labels and the number of joints was analyzed. Then the possible assembly actions are classification by Dynamic Time Warping through the combination of the therblig recognition results. In the first part of the experiment, it was found that the action labeling with 26 joints resulted in the best accuracy of the recognition model, while the therblig labeling with only 23 joints of the handedness was required to obtain the best accuracy according to the therblig definition. The second part of the experiment showed that the accuracy of action classification with therblig prediction result was 73.75%. In addition, the accuracy of action classification with therblig combined with the corresponding object (Therblig-Item) could be increased to 81.25%. The result of the experiments can conclude that the proposed method can reduce the cost of retraining the model for action recognition by using the therblig recognition.

摘要    i
ABSTRACT    ii
致謝    iii
TABLE OF CONTENTS    iv
LIST OF FIGURES    vi
LIST OF TABLES    viii
CHAPTER 1.    INTRODUCTION    1
1.1    The Status of Manufacturing Industry    1
1.2    Application Difficulties of Human Action Recognition in manufacturing industry    2
1.3    Thesis Structure    3
CHAPTER 2.    LITERATURE REVIEW    4
2.1    Human Action Recognition    4
2.2    Skeleton-based Human Action Recognition    8
2.3    Training Problem of Deep Learning Model    9
CHAPTER 3.    METHODOLOGY    11
3.1    Research Framework    11
3.2    OpenPose Skeleton    12
3.2.1    Human Body Skeleton Detection    14
3.2.2    Hand Skeleton Detection    14
3.3    Skeleton-based Hand Gesture Recognition    15
3.4    Filter    18
3.4.1    Accumulative Moving HAR Filter (AMHF)    18
3.4.2    Accumulative Moving DTW Filter (AMDF)    22
3.5    Action Prediction by using Therblig Recognition    24
CHAPTER 4.    EXPERIMENTS AND RESULTS    26
4.1    Data and Label    26
4.1.1    Data Acquisition    26
4.1.2    Data Labeling    30
4.1.3    Data Balancing    32
4.2    Simulation of Labeling Issue    34
4.3    Implementation    37
4.3.1    ST-GCN Configuration    37
4.3.2    HAR Model Performance Evaluation    40
4.3.3    Action Classification Performance Evaluation    41
4.4    Experiments and Results    42
4.4.1    Experiments of HAR prediction    42
4.4.2    Experiments of action classification by DTW    45
4.5    Result Discussion    48
CHAPTER 5.    CONCLUSION    50
5.1    Conclusion    50
5.2    Future work    51
REFERENCES    53
APPENDIX    58

                                

[1] W. Dai, A. Mujeeb, M. Erdt et al., "Soldering defect detection in automatic optical inspection," Advanced Engineering Informatics, vol. 43, p. 101004, 2020.
[2] F. B. Gilbreth and R. T. Kent, Motion Study: A Method For Increasing The Efficiency Of The Workman. D. Van Nostrand Company, 1911.
[3] M. Zeng, L. T. Nguyen, B. Yu et al., "Convolutional neural networks for human activity recognition using mobile sensors," in 6th International Conference on Mobile Computing, Applications and Services, Austin, TX, USA, Nov 6 - 7 2014: IEEE, pp. 197-205.
[4] N. Ho, P.-M. Wong, M. Chua et al., "Virtual reality training for assembly of hybrid medical devices," Multimedia Tools and Applications, vol. 77, no. 23, pp. 30651-30682, 2018.
[5] X. Yin, X. Fan, W. Zhu et al., "Synchronous AR assembly assistance and monitoring system based on ego-centric vision," Assembly Automation, 2019.
[6] J. Carreira and A. Zisserman, "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset," in IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, Jul 21 - 26 2017, pp. 6299-6308.
[7] Y. Sun, E. Lank, and M. Terry, "Label-and-Learn: Visualizing the Likelihood of Machine Learning Classifier's Success During Data Labeling," in 22nd International Conference on Intelligent User Interfaces, New York, USA, Mar 13 - 16 2017, pp. 523-534.
[8] S. Vishwakarma and A. Agrawal, "A Survey on Activity Recognition and Behavior Understanding in Video Surveillance," The Visual Computer, vol. 29, no. 10, pp. 983-1009, 2013.
[9] R. Poppe, "A Survey on Vision-Based Human Action Recognition," Image and vision computing, vol. 28, no. 6, pp. 976-990, 2010.
[10] K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild," arXiv preprint arXiv:1212.0402, 2012.
[11] A. Shahroudy, J. Liu, T.-T. Ng et al., "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis," in IEEE conference on computer vision and pattern recognition, Las Vegas, USA, Jun 26 - Jul 01 2016, pp. 1010-1019.
[12] M. Fu, N. Chen, Z. Huang et al., "Human Action Recognition: A Survey," in International Conference On Signal And Information Processing, Networking And Computers, 2018: Springer, pp. 69-77.
[13] M. Moniruzzaman, Z. Yin, Z. H. He et al., "Human Action Recognition by Discriminative Feature Pooling and Video Segmentation Attention Model," IEEE Transactions on Multimedia, 2021.
[14] C. Li, Q. Huang, X. Li et al., "Human Action Recognition Based on Multi-scale Feature Maps from Depth Video Sequences," arXiv preprint arXiv:2101.07618, 2021.
[15] C. Liu, J. Ying, H. Yang et al., "Improved human action recognition approach based on two-stream convolutional neural network model," The Visual Computer, pp. 1-15, 2020.
[16] M. E. Kalfaoglu, S. Kalkan, and A. A. Alatan, "Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition," in European Conference on Computer Vision, Aug 23 - 28 2020: Springer, pp. 731-747.
[17] J. Zang, L. Wang, Z. Liu et al., "Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition," in IFIP International Conference on Artificial Intelligence Applications and Innovations, Rhodes, Greece, May 25 - 27 2018: Springer, pp. 97-108.
[18] L. Wang, Y. Xu, J. Cheng et al., "Human Action Recognition by Learning Spatio-Temporal Features With Deep Neural Networks," IEEE access, vol. 6, pp. 17913-17922, 2018.
[19] L. Wang, Y. Xiong, Z. Wang et al., "Temporal Segment Networks: Towards Good Practices for Deep Action Recognition," in European conference on computer vision, Amsterdam, The Netherlands, Oct 8 - 16 2016: Springer, pp. 20-36.
[20] K. Simonyan and A. Zisserman, "Two-stream Convolutional Networks for Action Recognition in Videos," arXiv preprint arXiv:1406.2199, 2014.
[21] S. Ji, W. Xu, M. Yang et al., "3D Convolutional Neural Networks for Human Action Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2012.
[22] T. N. Kipf and M. Welling, "Semi-Supervised Classification with Graph Convolutional Networks," arXiv preprint arXiv:1609.02907, 2016.
[23] S. Yan, Y. Xiong, and D. Lin, "Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition," in AAAI conference on artificial intelligence, New Orleans, Louisiana, USA, Feb 2 – 7 2018, vol. 32, no. 1.
[24] M.-F. Tsai and C.-H. Chen, "Spatial Temporal Variation Graph Convolutional Networks (STV-GCN) for Skeleton-Based Emotional Action Recognition," IEEE Access, vol. 9, pp. 13870-13877, 2021.
[25] C. Liu, X. Li, Q. Li et al., "Robot Recognizing Humans Intention and Interacting with Humans Based on a Multi-Task Model Combining ST-GCN-LSTM Model and YOLO Model," Neurocomputing, vol. 430, pp. 174-184, 2021.
[26] U. Bhattacharya, T. Mittal, R. Chandra et al., "STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits," in AAAI Conference on Artificial Intelligence, New York, USA, Feb 7 - 12 2020, vol. 34, no. 02, pp. 1342-1350.
[27] Y. Li, Z. He, X. Ye et al., "Spatial Temporal Graph Convolutional Networks for Skeleton-Based Dynamic Hand Gesture Recognition," EURASIP Journal on Image and Video Processing, vol. 2019, no. 1, pp. 1-7, 2019.
[28] D. Feng, Z. Wu, J. Zhang et al., "Multi-Scale Spatial Temporal Graph Neural Network for Skeleton-Based Action Recognition," IEEE Access, vol. 9, pp. 58256-58265, 2021.
[29] O. Keskes and R. Noumeir, "Vision-Based Fall Detection Using ST-GCN," IEEE Access, vol. 9, pp. 28224-28236, 2021.
[30] H. Duan, Y. Zhao, K. Chen et al., "Revisiting Skeleton-based Action Recognition," arXiv preprint arXiv:2104.13586, 2021.
[31] J. Xie, W. Xin, R. Liu et al., "Cross-Channel Graph Convolutional Networks for Skeleton-Based Action Recognition," IEEE Access, vol. 9, pp. 9055-9065, 2021.
[32] J. Cai, N. Jiang, X. Han et al., "JOLO-GCN: Mining Joint-Centered Light-Weight Information for Skeleton-Based Action Recognition," in IEEE/CVF Winter Conference on Applications of Computer Vision, Jan 5 – 9 2021, pp. 2735-2744.
[33] X. Hao, J. Li, Y. Guo et al., "Hypergraph Neural Network for Skeleton-Based Action Recognition," IEEE Transactions on Image Processing, vol. 30, pp. 2263-2275, 2021.
[34] Y. Obinata and T. Yamamoto, "Temporal Extension Module for Skeleton-Based Action Recognition," in 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, Jan 10 - 15 2021: IEEE, pp. 534-540.
[35] K. Cheng, Y. Zhang, X. He et al., "Skeleton-Based Action Recognition With Shift Graph Convolutional Network," in IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 14 - 19 2020, pp. 183-192.
[36] L. Shi, Y. Zhang, J. Cheng et al., "Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks," IEEE Transactions on Image Processing, vol. 29, pp. 9532-9545, 2020.
[37] L. Shi, Y. Zhang, J. Cheng et al., "Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition," in IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, Jun 15 - 21 2019, pp. 12026-12035.
[38] C. Si, W. Chen, W. Wang et al., "Convolutional LSTM Network for Skeleton-Based Action Recognition," in IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, Jun 15 - 21 2019, pp. 1227-1236.
[39] D. Liang, G. Fan, G. Lin et al., "Three-Stream Convolutional Neural Network With Multi-Task and Ensemble Learning for 3D Action Recognition," in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, USA, Jun 16 - 17 2019.
[40] C. Reining, F. Niemann, F. Moya Rueda et al., "Human Activity Recognition for Production and Logistics—A Systematic Literature Review," Information, vol. 10, no. 8, p. 245, 2019.
[41] Z. Cao, G. Hidalgo, T. Simon et al., "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields," IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 172-186, 2019.
[42] H.-S. Fang, S. Xie, Y.-W. Tai et al., "Rmpe: Regional multi-person pose estimation," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct 22 - 29 2017, pp. 2334-2343.
[43] K. Kim and Y. K. Cho, "Effective Inertial Sensor Quantity and Locations on a Body for Deep Learning-Based Worker's Motion Recognition," Automation in Construction, vol. 113, p. 103126, 2020.
[44] B. Settles, "Active Learning Literature Survey," 2009.
[45] Z. Zhou, J. Shin, L. Zhang et al., "Fine-Tuning Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally," in IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA., Jul 21 - 26 2017, pp. 7340-7351.
[46] P. Dube, B. Bhattacharjee, S. Huo et al., "Automatic Labeling of Data for Transfer Learning," nature, vol. 192255, 2019.
[47] H. Gammulle, T. Fernando, S. Denman et al., "Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation," in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019: IEEE, pp. 200-209.
[48] Z. Wang, Z. Gao, L. Wang et al., "Boundary-aware cascade networks for temporal action segmentation," in European Conference on Computer Vision, 2020: Springer, pp. 34-51.
[49] G. Hidalgo, Z. Cao, T. Simon et al. "OpenPose: Real-time Multi-person Keypoint Detection Library for Body, Face, Hands, and Foot Estimation." https://github.com/CMU-Perceptual-Computing-Lab/openpose (accessed Jun 03, 2021).
[50] Z. Cao, T. Simon, S.-E. Wei et al., "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields," in IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 2017, pp. 7291-7299.
[51] T. Simon, H. Joo, I. Matthews et al., "Hand Keypoint Detection in Single Images using Multiview Bootstrapping," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 1145-1153.
[52] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
[53] G. Frank B and G. L. M, Classifying the elements of Work. Management and Administration, 1924.
[54] C.-L. Yang, W.-T. Li, and S.-C. Hsu, "Skeleton-based Hand Gesture Recognition for Assembly Line Operation," in 2020 International Conference on Advanced Robotics and Intelligent Systems (ARIS), 2020: IEEE, pp. 1-6.
[55] W. Kay, J. Carreira, K. Simonyan et al., "The Kinetics Human Action Video Dataset.," arXiv preprint arXiv:1705.06950, 2017.
[56] M. Müller, "Dynamic Time Warping," Information retrieval for music and motion, pp. 69-84, 2007.
[57] J. Redmon, S. Divvala, R. Girshick et al., "You Only Look Once: Unified, Real-Time Object Detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, Jun 27 - 30 2016, pp. 779-788.

全文公開日期 2024/08/11 (校內網路)
全文公開日期 2026/08/11 (校外網路)
全文公開日期 2026/08/11 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文