基於三維卷積網路與物件偵測作業員清潔動作解析之研究

簡易檢索 / 詳目顯示

回結果列表

研究生：	潘葦菱 Wei-Ling Pan
論文名稱：	基於三維卷積網路與物件偵測作業員清潔動作解析之研究 Untrimmed Operator Cleaning Action Parsing based on 3D Convolutional Neural Network and Object Detection
指導教授：	周碩彥 Shuo-Yan Chou
口試委員:	周碩彥 Shuo-Yan Chou 郭伯勳 Po-Hsun Kuo 羅士哲 Shih-Che Lo
學位類別：	碩士 Master
系所名稱：	管理學院 - 工業管理系 Department of Industrial Management
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	58
中文關鍵詞：	三維卷積網路、動作分割、物件偵測、作業員動作解析
外文關鍵詞：	3D Convolutional Neural Network (3DCNN), Operator Action Parsing
相關次數：	點閱：209 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

伴隨尖端科技的進步，越是高科技的產品越需要高品質的製造環境，如:半導體製造業、食品業、醫療、精密工業等皆引入無塵室的應用。然而，就無塵室內的製程而言，微小粒子不僅會造成環境污染，同時會導致產品良率下降。因此人員進入無塵室前，移除身上附著的微小粒子至關重要。然而，近十年來，由於深度學習的出現和大規模動作數據集的可用性，以及電腦視覺領域在實際場域上的廣泛應用，使計算機視覺中具重要任務之一的動作辨識可以快速地發展，促使更多的業者於廠區內導入智慧影像分析與監控，期望提高人力運用的效率，簡化且快速反應場域監視人員之需求。本研究在三維卷積神經網絡 (3DCNN) 和目標檢測架構上，提出基於標準清潔動作解析的兩種機制。一是、從RGB攝像機拍攝到的連續清潔動作程序中，每採樣n幀的影像畫面，便透過3DCNN判斷動作類別，並根據類別結果分割出7種的獨立清潔動作；二是、運用YOLO物件檢測方法偵測黏塵棒的位置，計算目標中心與檢測點之間的距離，個別監視動作執行之完整度。本論文的目標是建置一套能夠監控作業員之黏塵動作確實與否的系統，經研究證明，3DCNN能分辨時序上的動作差異，並提取目標動作畫面，進而搭配YOLOv4演算法，落實自動化地監控作業員之黏塵程序。此研究架構亦可被運用於工廠中各種動作程序的辨識，以有效的確保作業效能與人員安全；抑或是有監控需求之應用情境。

With the advancement of cutting-edge technology, the more high-tech products need more high-quality manufacturing environment, such as: semiconductor manufacturing, medical treatment, precision industry, etc. As for the process in the clean room, small particles not only cause environmental pollution, but also lead to the decrease of product yield. Therefore, it's important to clear away the particles from the body before you enter the clean room. In recent years, more and more companies are implementing intelligent monitoring to their factories. It is expected to improve the efficiency of labor utilization, simplify and quickly respond to field monitoring requirements, that makes computer vision technology widely use in the factory. Human action parsing is one of the important tasks in computer vision, which is highly related to action recognition for video understanding, that has been growing research interest over the last decade. This kind of research area can rapid growth thanks to the emergence of deep learning and more the availability of large-scale datasets, and due to its widely real-world applications. In this thesis, we describe an approach for untrimmed standard cleaning action parsing from RGB camera. The technology is based on 3D convolutional neural network (3DCNN) and object detection (YOLO). Furthermore, we propose two mechanism which is based on operator standard cleaning action parsing, one is for action segmentation by n-frame 3DCNN action classifier, the other is for action completion from object detector. In order to effectively remove the particles attached to the body, this project takes the standard self-cleaning procedure action as an example to monitor whether that every worker do seven self-cleaning actions correctly.

摘要    I
ABSTRACT        II
ACKNOWLEDGEMENT    III
CONTENTS        IV
LIST OF FIGURES    VI
LIST OF TABLES    VIII
LIST OF EQUATIONS    VIII
Chapter 1    Introduction    1
1    Background and Motivation    1
2    Challenges and Issues of Vision-based Activity Recognition    3
3    Research Objective and Contributions    5
4    Organization of the Research    5
Chapter 2    Literature Review    6
1    Vision-based Human Action Recognition    8
1.1    Frame Fusion and Two Stream    8
1.2    ConvLSTM    8
1.3    3D ConvNet    9
2    Temporal Action Detection and Localization    10
3    Object Detection    11
4    Human Parsing and Keypoints of Human Body    12
Chapter 3    Research Methodology    14
1    Data Collection    15
2    Untrimmed Video Action Detection    17
2.1    Data Preprocessing    17
2.2      Neural Network Modeling    18
2.3      Action Detection    21
3    Object Detection    22
3.1    Dataset Preparing    22
3.2    Important parameter of the regression bounding box    22
3.3    YOLOv4 algorithm    23
4    Action Completion Mechanism    30
Chapter 4    Implementation    32
1    Hardware and Software configuration    32
2    Action Detection    33
2.1    Dataset description    33
2.2    Classifier Model Training    34
2.3    Experimental Results    34
3    Dust Stick Detection    41
3.1    Dataset description    41
3.2    Create a relevant folder structure in YOLOv4 format    43
3.3    Detector training    44
3.4    Experimental Results    46
4    Action Completion Mechanism    50
Chapter 5    Conclusion and Future Research    53
1    Conclusion    53
2    Limitation    53
3    Future Research    54
REFERENCES      55
                                

1. Mohammad Hasnain R., R.S., Mayank P., Swapnil G., Smart Home Automation using Computer Vision and Segmented Image Processing. IEEE, 2019.
2. Wu, J., Research on roaming and interaction in VR game based on Unity 3D. IEEE, 2020.
3. Junfeng Gao, Y.Y., Pan Lin, Dong Sun Park, Computer Vision in Healthcare Applications. Journal of Healthcare Engineering, 2018.
4. Durai, G.S.a.M.A.S., Intelligent video surveillance: a review through deep learning techniques for crowd analysis. J Big Data 6, 48, 2019.
5. Rushil Khurana, K.A., Zac Yu, Jennifer Mankoff, Chris Harrison, Mayank Goel, GymCam: Detecting, Recognizing and Tracking Simultaneous Exercises in Unconstrained Scenes. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2018.
6. Lisa Schrader, A.V.T., Sebastian Konietzny, Stefan Rüping, Barbara Schäpers, Martina Steinböck, Carmen Krewer, Friedemann Müller, Jörg Güttler & Thomas Bock Advanced Sensing and Human Activity Recognition in Early Intervention and Rehabilitation of Elderly People. Population Ageing 13, 139–165, 2020.
7. Rene Grzeszick, J.M.L., Fernando Moya Rueda, Gernot A. Fink, Sascha Feldhorst, Michael ten Hompel, Deep neural network based human activity recognition for the order picking process. In Proceedings of the 4th international Workshop on Sensor-based Activity Recognition and Interaction., 2017.
8. Sina Mokhtarzadeh Azar, M.G.A., Ahmad Nickabadi, and Alexandre Alahi, Convolutional Relational Machine for Group Activity Recognition. CVPR, 2019.
9. Kaixuan Chen, D.Z., Lina Yao, Bin Guo, Zhiwen Yu, Yunhao Liu, Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities. arXiv, 2020.
10. Guan Yuan, Z.W., Fanrong Meng, Qiuyan Yan, Shixiong Xia, An overview of human activity recognition based on smartphone. Sensor Review, Vol. 39 No. 2, pp. 288-306., 2018.
11. J.K. Aggarwal, M.S.R., Human activity analysis: A review. ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 16, 2011.
12. H. Kuehne, H.J., E. Garrote, T. Poggio, T. Serre, HMDB: A Large Video Database for Human Motion Recognition. ICCV, 2011.
13. Khurram Soomro, A.R.Z., Mubarak Shah, A dataset of 101 human action classes from videos in the wild. CRCV-TR-12-01'12, 2012.
14. Will Kay, J.C., Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, Andrew Zisserman, The Kinetics Human Action Video Dataset. arXiv, 2017.
15. Andrej Karpathy, G.T., Sanketh Shetty, Thomas Leung, Rahul Sukthankar, Li Fei-Fei, Large-Scale Video Classification with Convolutional Neural Networks. CVPR, 2014.
16. Sami Abu-El-Haija, N.K., Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv, 2016.
17. Fabian Caba Heilbron, V.E., Bernard Ghanem, Juan Carlos Nieble, ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. CVPR, 2015.
18. W. Li, Z.Z., and Z. Liu., Action Recognition Based on a Bag of 3D Points. CVPRW, 2010.
19. Mehrtash T. Harandi, M.S., Richard Hartley, From Manifold to Manifold: Geometry-Aware Dimensionality Reduction for SPD Matrices. European Conference on Computer Vision, 2014.
20. Kiwon Yun, J.H., Debaleena Chattopadhyay, Tamara L. Berg, Dimitris Samaras, Two-person Interaction Detection Using Body-Pose Features and Multiple Instance Learning. CVPRW, 2012.
21. Xiaoqiang, Y., Unsupervised Human Action Categorization with Consensus Information Bottleneck Method. The 25th International Joint Conference on Artificial Intelligence IJCAI-16, 2016.
22. Y.-G. Jiang, J.L., A. R. Zamir, G. Toderici,et al, THUMOS challenge: Action recognition with a large number of classes. ICCV, 2014.
23. Stoian, A., Ferecatu, M., Benois-Pineau, J., Crucianu, M, Fast action localization in large scale video archives. IEEE TCSVT, 2015.
24. Quoc V. Le, W.Y.Z., Serena Y. Yeung, Andrew Y. Ng, Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. CVPR, 2011.
25. Xiu-Shen Wei, C.-L.Z., Hao Zhang, Jianxin Wu, Deep bimodal regression of apparent personality traits from short video sequences. TAC, 2017.
26. Simonyan, K., & Zisserman, A., Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
27. Rohit Girdhar, D.R., Abhinav Gupta, Josef Sivic, Bryan Russell, ActionVLAD: Learning spatio-temporal aggregation for action recognition. CVPR, 2017.
28. Ionut Cosmin Duta, B.I., Kiyoharu Aizawa, Nicu Sebe, Spatio-temporal vector of locally max-pooled features for action recognition in videos. CVPR, 2017.
29. Joe Yue-Hei Ng, M.H., Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici, Beyond Short Snippets: Deep Networks for Video Classification. arXiv, 2015.
30. Wenbin Du, Y.W., Yu Qiao, RPAN: An end-to-end recurrent pose-attention network for action recognition in videos. ICCV, 2017.
31. Jeff Donahue, L.A.H., Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell, Long-term Recurrent Convolutional Networks for Visual Recognition and Description. arXiv, 2014.
32. Du Tran, L.B., Rob Fergus, Lorenzo Torresani, Manohar Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV, 2015.
33. Limin Wang, Y.X., Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. arXiv, 2016.
34. Zhaofan Qiu, T.Y., Tao Mei, Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. ICCV, 2017.
35. Du Tran, H.W., Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri, A Closer Look at Spatiotemporal Convolutions for Action Recognition. CVPR, 2018.
36. Mohammadreza Zolfaghari, K.S., and Thomas Brox, Efficient Convolutional Network for Online Video Understanding. European Conference on Computer Vision, 2018.
37. Nieves Crasto, P.W., Karteek Alahari, Cordelia Schmid, MARS: Motion-Augmented RGB Stream for Action Recognition. CVPR, 2019.
38. Lin Sun, K.J., Dit-Yan Yeung, Bertram E. Shi, Human action recognition using factorized spatio-temporal convolutional networks. ICCV, 2015.
39. Yu-Wei Chao, S.V., Bryan Seybold, David A. Ross, Jia Deng, Rahul Sukthankar, Rethinking the faster R-CNN architecture for temporal action localization. CVPR, 2018.
40. Limin Wang, Y.X., Dahua Lin, Luc Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection. CVPR, 2017.
41. Matthew Hutchinson, V.G., Video Action Understanding: A Tutorial. arXiv, 2020.
42. Zheng Shou, D.W., Shih-Fu Chang, Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. CVPR, 2016.
43. Jiyang Gao, Z.Y., Chen Sun, Kan Chen, Ram Nevatia, TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. ICCV, 2017.
44. Yuanjun Xiong, Y.Z., Limin Wang, Dahua Lin, Xiaoou Tang, A Pursuit of Temporal Accuracy in General Activity Detection. arXiv, 2017.
45. Wei Liu, S.L., Weiqiang Ren, Weidong Hu, Yinan Yu, High-Level Semantic Feature Detection: A New Perspective for Pedestrian Detection. CVPR, 2019.
46. Jiankang Deng, J.G., Niannan Xue, Stefanos Zafeiriou, ArcFace: Additive Angular Margin Loss for Deep Face Recognition. arXiv, 2018.
47. H. Cho, M.-C.S., Bongjin Jun, Canny Text Detector: Fast and Robust Scene Text Localization Algorithm. CVPR, 2016.
48. Kumar, A.D., Novel Deep Learning Model for Traffic Sign Detection Using Capsule Networks. arXiv, 2018.
49. Runwei Ding, L.D., Guangpeng Li, Hong Liu, TDD-Net: A Tiny Defect Detection Network for Printed Circuit Boards. CAAI Transactions on Intelligence Technology, 2019.
50. Ross Girshick, J.D., Trevor Darrell, Jitendra Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. arXiv, 2014.
51. Kaiming He, X.Z., Shaoqing Ren, Jian Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.
52. Girshick, R., Fast R-CNN. ICCV, 2015.
53. S. Ren, K.H., R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS, 2015.
54. Joseph Redmon, S.D., Ross Girshick, Ali Farhadi, You Only Look Once: Unified, Real-Time Object Detection. arXiv, 2015.
55. Wei Liu, D.A., Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg, SSD: Single Shot MultiBox Detector. arXiv, 2015.
56. JING YANG, S.L., ZHENG WANG, AND GUANCI YANG, Real-Time Tiny Part Defect Detection System in Manufacturing Using Deep Learning. IEEE, 2019.
57. Krisha Bhambani, T.J., Kavita A. Sultanpure, Real-time Face Mask and Social Distancing Violation Detection System using YOLO. IEEE, 2020.
58. Shehan P Rajendran, L.S., R Pradeep, Sajith Vijayaraghavan, Real-Time Traffic Sign Recognition using YOLOv3 based Detector. IEEE, 2019.
59. Ke Gong, X.L., Dongyu Zhang, Xiaohui Shen, Liang Lin, Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing. CVPR, 2017.
60. Xinchen Liu, M.Z., Wu Liu, Jingkuan Song, Tao Mei, BraidNet: Braiding Semantics and Details for Accurate Human Parsing. ACM International Conference on Multimedia, 2019.
61. Xiaodan Liang, K.G., Xiaohui Shen, Liang Lin, Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark. arXiv, 2018.
62. Hao-Shu Fang, G.L., Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, Cewu Lu, Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer. CVPR, 2018.
63. Yilun Chen, Z.W., Yuxiang Peng, Zhiqiang Zhang, Gang Yu, Jian Sun, Cascaded Pyramid Network for Multi-person Pose Estimation. CVPR, 2018.
64. Tsung-Yi Lin, P.D., Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie, Feature Pyramid Networks for Object Detection. arXiv, 2016.
65. Zhe Cao, G.H., Tomas Simon, Shih-En Wei, Yaser Sheikh, OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arXiv, 2018.
66. Kaiming He, X.Z., Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition. CVPR, 2016.
67. Joseph Redmon, A.F., YOLOv3: An Incremental Improvement. arXiv, 2018.
68. Alexey Bochkovskiy, C.-Y.W., Hong-Yuan Mark Liao, YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv, 2020.

全文公開日期 2026/06/10 (校內網路)
全文公開日期 2026/06/10 (校外網路)
全文公開日期 2026/06/10 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文