簡易檢索 / 詳目顯示

研究生: 劉恩
En Liou
論文名稱: 應用狀態空間模型於訓練資料去識別化之人體動作辨識
Utilizing State Space Model to Human Action Recognition for De-identify Training Data
指導教授: 楊朝龍
Chao-Lung Yang
口試委員: 楊朝龍
Chao-Lung Yang
王孔政
Kung-Jeng Wang
林柏廷
Po-Ting Lin
學位類別: 碩士
Master
系所名稱: 產學創新學院 - 智慧製造科技研究所
Graduate Institute of Intelligent Manufacturing Tech
論文出版年: 2024
畢業學年度: 113
語文別: 英文
論文頁數: 40
中文關鍵詞: 人體動作辨識姿勢估計去識別化匿名化狀態空間模型
外文關鍵詞: Human action recognition, Pose estimation, De-identification, Anonymization, State-space model
相關次數: 點閱:11下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,隨著人工智慧的快速發展,電腦視覺(Computer Vision)在工業 製造的應用展現出了極大潛力,尤其在人體動作辨識(Human Action Recognition, HAR)領域。然而,在實際生產線上,如何在進行精確動作辨識的 情況下有效保護工人隱私,仍是一項艱鉅的挑戰。為此,本研究提出一個名為 骨架特徵遮罩(Skeleton Feature Masking, SFM)的輕量化 RGB 視覺框架,該框 架旨在透過將原始影像與骨架資訊融合,來實現高效且準確的動作辨識,同時 保障個人隱私。首先,本研究將工作者主要動作關注區域(Region of Interest, ROI)中的原始影像,與經由姿勢估計(Pose Estimation)技術提取出的人體骨 架資訊進行整合。接著,四種去識別化策略被用於將骨架資訊融合至原始影片 中,以去除個人識別資料並保留關鍵資訊以供後續動作辨識。在此框架中,使 用滑動視窗法將數據分段,作為狀態空間模型(State-space Model, SSM)之輸 入,以進行連續動作辨識。隨後以模擬製造現場的影片來訓練並驗證該 SSM 模 型,以期達成實際工業應用。實驗結果顯示,本研究提出之方法可將操作員標 準作業程序(SOP)動作辨識的準確率從 87.68%提升至 95.77%,同時,數據量 減少至少 40%,顯著提高了辨識效能。


    With the rapid development of artificial intelligence, the application of computer vision in industrial manufacturing site has shown great potential, especially in human action recognition (HAR). However, ensuring accurate HAR while protecting worker privacy on real production lines remains a difficult challenge. In this paper, a light- weight RGB-based visual framework, called Skeleton Feature Masking (SFM), is proposed. This framework aims to achieve efficient and accurate HAR while preserving privacy by integrating raw images with skeletal information. First, the raw images on the region of interest (ROI) of the main actions were integrated with the skeletal information of human workers extracted by post-estimation. Four de-identification strategies are used to fuse the skeletal information with the original video, removing personal identification data while retaining key information for motion recognition. In this work, the sliding window method is used to segment the data as input to the state space method (SSM) for sequential action recognition, and then a series of videos recorded from the manufacturing site are used to train and validate this SSM model to achieve practical industrial applications. The experimental results show the action recognition accuracy is improved from 87.68% to 95.77% while the data volume is reduced by significantly at least 40% in the recognition of operator SOP actions.

    摘要.......................................i ABSTRACT ........................ii TABLE OF CONTENTS..................... iii LIST OF FIGURES .....................v LIST OF TABLES.......................vi CHAPTER 1. INTRODUCTION...............................1 CHAPTER 2. LITERATURE REVIEW...................4 2.1. Human Action Recognition (HAR)............4 2.2. RGB Modality................................7 2.3. Skeleton Modality ......................................8 2.4. State-space model (SSM)...........................9 2.5. De-identification and privacy protection research in computer vision ...... 11 CHAPTER 3. METHODOLOGY ...........................13 3.1. Research structure ....................................13 3.2. Skeleton feature extraction.......................15 3.2.1 Mediapipe Hands.........................16 3.2.2 Mediapipe Pose ...........................17 3.3. De-identification and fusion strategies...............20 3.4 Video Mamba...........................................22 CHAPTER 4. EXPERIMENTS AND RESULTS.............27 4.1. Data collection and review.......................27 4.2. Implementation......................................... 29 4.2.1 Architecture configuration ..........29 4.2.2 Performance evaluation...............30 4.3. Experiments and results ...........................31 CHAPTER 5. CONCLUSION .................................35 REFERENCES.................37

    1. Sun, Z., et al., Human Action Recognition From Various Data Modalities: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 45(3): p. 3200-3225.
    2. Reining, C., et al. Human Activity Recognition for Production and Logistics— A Systematic Literature Review. Information, 2019. 10, DOI: 10.3390/info10080245.
    3. Javaid, M., et al., Understanding the adoption of Industry 4.0 technologies in improving environmental sustainability. Sustainable Operations and Computers, 2022. 3: p. 203-217.
    4. Patalas-Maliszewska, J., D. Halikowski, and R. Damaševičius An Automated Recognition of Work Activity in Industrial Manufacturing Using Convolutional Neural Networks. Electronics, 2021. 10, DOI: 10.3390/electronics10232946.
    5. Selvaraj, V., et al., Real-time action localization of manual assembly operations using deep learning and augmented inference state machines. Journal of Manufacturing Systems, 2024. 72: p. 504-518.
    6. Yang, C.-L., et al., HAR-time: Human action recognition with time factor analysis on worker operating time. International Journal of Computer Integrated Manufacturing, 2023. 36(8): p. 1219-1237.
    7. Zhang, S., et al. Deep Learning in Human Activity Recognition with Wearable Sensors: A Review on Advances. Sensors, 2022. 22, DOI: 10.3390/s22041476.
    8. Wang, S., et al. Research and Design of Human Behavior Recognition Method
    in Industrial Production Based on Depth Image. in 2022 4th International
    Conference on Industrial Artificial Intelligence (IAI). 2022.
    9. Seeja, G., et al., Internet of things and robotic applications in the industrial
    automation process, in Innovations in the Industrial Internet of Things (IIoT)
    and Smart Factory. 2021, IGI Global. p. 50-64.
    10. Moutinho, D., et al., Deep learning-based human action recognition to
    leverage context awareness in collaborative assembly. Robotics and
    Computer-Integrated Manufacturing, 2023. 80: p. 102449.
    11. Niemann, F., et al., Applications of human activity recognition in industrial
    processes--Synergy of human and technology. arXiv preprint
    arXiv:2212.02266, 2022.
    12. Bula, J., et al. Human Behavior Identification and Recognition System. in 2021
    IEEE Asia-Pacific Conference on Computer Science and Data Engineering
    (CSDE). 2021. IEEE.
    13. Wang, M., et al., Identifying personal physiological data risks to the Internet of Everything: the case of facial data breach risks. Humanities and Social
    Sciences Communications, 2023. 10(1): p. 1-15.
    14. Aouedi, O., et al., A Survey on Intelligent Internet of Things: Applications,
    Security, Privacy, and Future Directions. arXiv e-prints, 2024: p. arXiv:
    2406.03820.
    15. Lin, W., et al. Human activity recognition for video surveillance. in 2008 IEEE
    international symposium on circuits and systems (ISCAS). 2008. IEEE.
    16. Hu, W., et al., Semantic-based surveillance video retrieval. IEEE Transactions
    on image processing, 2007. 16(4): p. 1168-1181.
    17. Lu, M., Y. Hu, and X. Lu, Driver action recognition using deformable and
    dilated faster R-CNN with optimized region proposals. Applied Intelligence,
    2020. 50(4): p. 1100-1111.
    18. Subasi, A., et al., Human activity recognition using machine learning methods in a smart healthcare environment, in Innovation in health informatics. 2020,
    19. Alavigharahbagh, A., et al., Deep learning approach for human action
    recognition using a time saliency map based on motion features considering
    camera movement and shot in video image sequences. Information, 2023.
    20. Beddiar, D.R., et al., Vision-based human activity recognition: a survey.
    Multimedia Tools and Applications, 2020. 79(41): p. 30509-30555.
    21. Soomro, K., UCF101: A dataset of 101 human actions classes from videos in
    the wild. arXiv preprint arXiv:1212.0402, 2012.
    22. Liu, J., et al., Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine
    intelligence, 2019. 42(10): p. 2684-2701.
    23. Goyal, R., et al. The" something something" video database for learning and
    evaluating visual common sense. in Proceedings of the IEEE international
    conference on computer vision. 2017.
    24. Kay, W., et al., The kinetics human action video dataset. arXiv preprint
    arXiv:1705.06950, 2017.
    25. Surek, G.A.S., et al., Video-based human activity recognition using deep
    learning approaches. Sensors, 2023. 23(14): p. 6384.
    26. Bertasius, G., H. Wang, and L. Torresani. Is space-time attention all you need for video understanding? in ICML. 2021.
    27. Li, K., et al., Videomamba: State space model for efficient video
    understanding. arXiv preprint arXiv:2403.06977, 2024.
    28. Yan, S., Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks
    for skeleton-based action recognition. in Proceedings of the AAAI conference on artificial intelligence. 2018.
    29. Li, M., et al. Actional-structural graph convolutional networks for skeleton- based action recognition. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
    30. Duan, H., et al., Dg-stgcn: Dynamic spatial-temporal modeling for skeleton- based action recognition. arXiv preprint arXiv:2210.05895, 2022.
    31. Feichtenhofer, C., et al. Slowfast networks for video recognition. in Proceedings of the IEEE/CVF international conference on computer vision. 2019.
    32. Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
    33. Duan, H., et al. Revisiting skeleton-based action recognition. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
    34. Dosovitskiy, A., et al., An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    35. Liu, Z., et al. Video swin transformer. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
    36. Wang, Y., et al., Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024.
    37. Li, K., et al., Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676, 2022.
    38. Pareek, P. and A. Thakkar, A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, 2021. 54(3): p. 2259-2322.
    39. Vaswani, A., et al., Attention is all you need in advances in neural information processing systems, 2017. Search PubMed: p. 5998-6008.
    40. Sun, K., et al. Deep high-resolution representation learning for human pose estimation. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
    41. Shotton, J., et al. Real-time human pose recognition in parts from single depth images. in CVPR 2011. 2011. Ieee.
    42. Zheng, C., et al., Deep learning-based human pose estimation: A survey. ACM Computing Surveys, 2023. 56(1): p. 1-37.
    43. Lovanshi, M. and V. Tiwari, Human skeleton pose and spatio-temporal feature-based activity recognition using ST-GCN. Multimedia Tools and Applications, 2024. 83(5): p. 12705-12730.
    44. Yue, R., Z. Tian, and S. Du, Action recognition based on RGB and skeleton
    data sets: A survey. Neurocomputing, 2022. 512: p. 287-306.
    45. Ali, A., et al., Skeleton-based human action recognition via convolutional
    neural networks (CNN). arXiv preprint arXiv:2301.13360, 2023.
    46. Gu, A., K. Goel, and C. Ré, Efficiently modeling long sequences with
    structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
    47. Gu, A. and T. Dao, Mamba: Linear-time sequence modeling with selective
    state spaces. arXiv preprint arXiv:2312.00752, 2023.
    48. Zhu, L., et al., Vision mamba: Efficient visual representation learning with
    bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
    49. Liu, Y., et al., Vmamba: Visual state space model. arXiv 2024. arXiv preprint
    arXiv:2401.10166.
    50. Islam, M.M. and G. Bertasius. Long movie clip classification with state-space
    video models. in European Conference on Computer Vision. 2022. Springer.
    51. Guo, H., et al., MambaIR: A Simple Baseline for Image Restoration with State-Space Model. arXiv preprint arXiv:2402.15648, 2024.
    52. Han, D., et al., Demystify Mamba in Vision: A Linear Attention Perspective.
    arXiv preprint arXiv:2405.16605, 2024.
    53. Chevrier, R., et al., Use and understanding of anonymization and de-
    identification in the biomedical literature: scoping review. Journal of medical
    Internet research, 2019. 21(5): p. e13484.
    54. Zhu, B., et al. Deepfakes for medical video de-identification: Privacy
    protection and diagnostic information preservation. in Proceedings of the
    AAAI/ACM Conference on AI, Ethics, and Society. 2020.
    55. Shahid, A., et al. A two-stage de-identification process for privacy-preserving
    medical image analysis. in Healthcare. 2022. MDPI.
    56. Yang, C.L., et al., Secure and Privacy-Preserving Human Interaction
    Recognition of Pervasive Healthcare Monitoring. IEEE Transactions on
    Network Science and Engineering, 2023. 10(5): p. 2439-2454.
    57. Zhang, F., et al., Mediapipe hands: On-device real-time hand tracking. arXiv
    preprint arXiv:2006.10214, 2020.
    58. Bazarevsky, V., et al., Blazepose: On-device real-time body pose tracking.
    arXiv preprint arXiv:2006.10204, 2020.
    59. Bazarevsky, V., et al., Blazeface: Sub-millisecond neural face detection on
    mobile gpus. arXiv preprint arXiv:1907.05047, 2019.
    60. MediaPipe_Hands. 2020; Available from:
    https://chuoling.github.io/mediapipe/solutions/hands.html.
    61. Gu, A., et al., Hippo: Recurrent memory with optimal polynomial projections.
    Advances in neural information processing systems, 2020. 33: p. 1474-1487.

    無法下載圖示 全文公開日期 2026/11/05 (校內網路)
    全文公開日期 2026/11/05 (校外網路)
    全文公開日期 2026/11/05 (國家圖書館:臺灣博碩士論文系統)
    QR CODE