簡易檢索 / 詳目顯示

研究生: 許勝傑
Shen-Chieh Hsu
論文名稱: 基於深度學習神經網路之動作辨識系統研究
Study of Action Recognition System Based on Deep Learning Neural Networks
指導教授: 楊振雄
Cheng-Hsiung Yang
口試委員: 吳常熙
陳金聖
郭永麟
學位類別: 碩士
Master
系所名稱: 工程學院 - 自動化及控制研究所
Graduate Institute of Automation and Control
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 92
中文關鍵詞: 深度學習Horn-Schunck光流演算法GoogLeNet神經網路Caffe架構動作辨識
外文關鍵詞: Deep learning, Horn-Schunck optical flow algorithm, GoogLeNet neural network, Caffe architecture, human action recognition
相關次數: 點閱:291下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 深度學習技術隨著系統設備連結產生大量數據與電腦平行運算能力快速提升,加速推動深度學習的發展。辨識能力的高低對於實際應用上是重要的指標,本論文針對影像動作辨識能力進行效能改善解決預測能力較低的問題。
    本論文採用UCF-101動作辨識資料庫作為方法驗證基礎,影像特徵擷取部分使用GoogLeNet神經網路架構擷取時間特徵(Spatial feature)、光流法架構擷取空間特徵(Temporal feature)於雙向雙層LSTM架構進行訓練與驗證。過程中先由小量(11種類別)動作識別資料庫進行方法測試,再經由完整的動作識別資料庫進行大量(101種類別)方法驗證。經過實驗結果顯示應用於UCF-101動作辨識資料庫可以達到準確率(Accuracy)90.26539%、精確率(Precision)91.07881%、召回率(Recall)99.00992%、以及平衡分數(F1 Score)94.87891%。


    The rapid development of deep learning technology is due to the ability of connected device systems to generate large amounts of data and the parallel operation of computers, accelerate promote the rapid development of deep learning. In practical applications, recognition results prediction ability is a very important target. In this thesis, the performance is improved according to the image action recognition ability, and the problem of low prediction ability is solved.
    This thesis uses the UCF-101 action recognition datasets as the basis of method verification. Spatial features are extracted from the GoogLeNet neural network architecture, and temporal features are extracted from the Horn-Schunck optical flow method, used in bidirectional and multi-layer LSTM model to achieve model training and verification. During processing use a small number (11 types) of action recognition datasets for method testing, and then perform method verification through the complete (101 types) of action recognition datasets. Experiment results show that the application in the UCF-101 action recognition datasets can achieve accuracy 90.26539%, precision 91.07881%, recall 99.00992%, and F1 Score 94.87891%.

    摘要 I Abstract II 致謝 III Contents IV List of Figure VI List of Table VIII Chapter 1 Introduction 1 1.1 Introduction 1 1.2 Literature Review 2 1.3 Research Motivation and Purpose 4 1.4 Outline 4 Chapter 2 Feature Extraction Method Implementation 7 2.1 Caffe Framework 8 2.1.1 Blobs 10 2.1.2 Layers 11 2.1.3 Net 12 2.2 Feature Extraction Method 13 2.2.1 AlexNet Convolution Neural Network 13 2.2.2 GoogLeNet Inception Architecture 20 2.2.3 Horn-Schunck Optical Flow Algorithm 28 2.3 Action Recognition Dataset 32 2.3.1 UCF-101 Dataset 33 Chapter 3 Based on Deep Learning Action Recognition System 35 3.1 Convolutional Neural Network 36 3.1.1 Convolution Layer 37 3.1.2 Pooling Layer 38 3.1.3 Fully Connected Layer 39 3.1.4 Loss Layer 40 3.2 Dual Direction LSTM Model 41 3.2.1 Forward and Backward Propagation 47 3.3 Optimizer and Loss Function 48 3.3.1 Adam Optimizer 49 3.3.2 Softmax Cross Entropy 51 Chapter 4 Experiment Result and Analysis 53 4.1 Experimental Environment 53 4.2 Convolution Neural Network 54 4.3 Inception Architecture 58 4.4 Optical Flow Algorithm 61 4.5 Features Integrate 63 4.5.1 Model Testing 63 4.5.2 Confusion Matrix 66 4.5.3 Numerous Datasets Verification 68 Chapter 5 Conclusions and Future Work 74 5.1 Conclusions 74 5.2 Future Work 75 Reference 76

    [1] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, “Action recognition in video sequences using deep bi-directional lstm with cnn features,” IEEE Access., vol. 6, pp. 1155-1166, 2017.
    [2] X. Wang, L. Gao, J. Song, H. Shen, “Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition,” IEEE Signal processing letters., vol. 24, no. 6, pp. 510-514, 2017.
    [3] R. Rothe, R. Timofte, L. V. Gool, “Deep expectation of real and apparent age from a single image without facial landmarks,” International Journal of Computer Vision., vol. 126, no. 2-4, pp. 144-157, 2018.
    [4] Q. Fang, et al, “Detecting non-hardhat-use by a deep learning method from far-field surveillance videos,” Automation in Construction., vol. 85, pp. 1-9, 2018.
    [5] W. Min, H. Cui, H. Rao, Z. Li, L. Yao, “Detection of human falls on furniture using scene analysis based on deep learning and activity characteristics,” IEEE Access., vol. 6, pp. 9324-9335, 2017.
    [6] S. Zhang, C. Gao, J. Zhang, F. Chen, N. Sang, “Discriminative part selection for human action recognition,” IEEE Transactions on Multimedia., vol. 20, no. 4, pp. 769-780, 2017.
    [7] P. Tang, H. Wang, S. Kwong, “G-MS2F: Googlenet based multi-stage feature fusion of deep cnn for scene recognition,” Neurocomputing., vol. 255, pp. 188-197, 2017.
    [8] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopal, “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 39, no. 4, pp. 677-691, 2017.
    [9] W. Du, Y. Wang, Y. Qiao, “Recurrent spatial-temporal attention network for action recognition in videos,” IEEE Transactions on Image Processing., vol. 27, no. 3, pp. 1347-1360, 2018.
    [10] M. Ma, N. Marturi, Y. Li, A. Leonardis, R. Stolkin, “Region-sequence based six-stream cnn features for general and fine-grained human action recognition in videos,” Pattern Recognition., vol. 76, pp. 506-521, 2018.
    [11] Y. Shi, Y. Tian, Y. Wang, T. Huang, “Sequential deep trajectory descriptor for action recognition with three-stream cnn,” IEEE Transactions on Multimedia., vol. 19, no. 7, pp. 1510-1520, 2017.
    [12] X. Qiao, C. Zhou, C. Xu, Z. Cui, J. Yang, “Action recognition with spatial-temporal representation analysis across grassmannian manifold and euclidean space,” IEEE International Conference on Image Processing (ICIP)., 2018.
    [13] C. F. Chiu, C. H. Kuo, P. C. Chang, “Smoking action recognition based on spatial-temporal convolutional neural networks,” Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPAASC)., 2018.
    [14] F. Pan, Y. Guo, Z. Yan, J. Guo, “Temporal segment convolutional kernel networks for sequence modeling of videos,” IEEE International Conference on Multimedia and Expo (ICME)., 2019.
    [15] Q. Xu, J. See, W. Lin, “Localization guided fight action detection in surveillance videos,” IEEE International Conference on Multimedia and Expo (ICME)., 2019.
    [16] F. Wang, G. Wang, Y. Hang, H. Chu, “SAST: Learning semantic action-aware spatial-temporal features for efficient action recognition,” IEEE Access., vol. 7, pp. 164876-164886, 2019.
    [17] A. Ullah, K. Muhammad, J. D. Ser, S. W. Baik, V. H. C. Albuquerque, “Activity recognition using temporal optical flow convolutional features and multilayer lstm,” IEEE Transactions on Industrial Electronics., vol. 66, no. 12, pp. 9692-9702, 2019.
    [18] Q. Fu, S. Ma, L. Liu, J. Liu, “Human action recognition based on sparse lstm auto-encoder and improved 3d cnn,” International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)., 2018.
    [19] J. Wang, Y. Liu, “Kinematics features for 3d action recognition using two-stream cnn,” World Congress on Intelligent Control and Automation (WCICA)., 2018.
    [20] G. An, W. Zhou, Y. Wu, Z. Zheng, Y. Liu, “Squeeze-and excitation on spatial and temporal deep feature space for action recognition,” IEEE International Conference on Signal Processing (ICSP)., 2018.
    [21] A. Kamel, et al, “Deep convolutional neural networks for human action recognition using depth maps and postures,” IEEE Transactions on Systems, Man, and Cybernetics: Systems., vol. 49, no. 9, pp. 1806-1819, 2018.
    [22] S. Verma, P. Nagar, D. Gupta, C. Arora, “Making third person techniques recognize first-person action in egocentric videos,” IEEE International Conference on Image Processing (ICIP)., 2018.
    [23] Y. Liao, P. Xiong, W. Min, W. Min, J. Lu, “Dynamic sign language recognition based on video sequence with blstm-3d residual networks,” IEEE Access., vol. 7, pp. 38044-38054, 2019.
    [24] G. Yao, J. Zhong, T. Lei, X. Liu, “Constructing hierarchical spatiotemporal information for action recognition,” IEEE SmartWorld., 2018.
    [25] L. Wang, et al, “Temporal segment networks for action recognition in videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 41, no. 11, pp. 2740-2755, 2019.
    [26] W. Dai, Y. Chen, C. Huang, M. K. Gao, X. Zhang, “Two-stream convolution neural networks with video-stream for action recognition,” International Joint Conference on Neural Networks (IJCNN)., 2019.
    [27] D. Purwanto, R. R. A. Pramono, Y. T. Chen, W. H. Fang, “Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos,” IEEE Signal Processing Letters., vol. 26, no. 8, pp. 1187-1191, 2019.
    [28] A. Shahroudy, T. T. Ng, Y. Gong, G. Wang, “Deep multimodal feature analysis for action recognition in rgb+d videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 40, no. 5, pp. 1045-1058, 2018.
    [29] P. N. Soentanto, J. Hendryli, D. E. Herwindiati, “Object and human action recognition from video using deep learning models,” IEEE International Conference on Signals and Systems (ICSigSys)., 2019.
    [30] F. Xue, H. Ji, W. Zhang, Y. Cao, “Action recognition based on dense trajectories and human detection,” IEEE International Conference on Automation, Electronics and Electrical Engineering (AUTEEE)., 2018.
    [31] J. Zhu, W. Zou, Z. Zhu, L. Xu, G. Huang, “Action Machine: Toward person-centric action recognition in videos,” IEEE Signal Processing Letters., vol. 26, no. 11, pp. 1633-1637, 2019.
    [32] C. Li, Y. Ming, Y. Shen, H. Yu, “Deep key clips-video feature fusion framework for action recognition,” IEEE International Conference on Multimedia & Expo Workshops (ICMEW)., 2019.
    [33] L. Song, L. Weng, L. Wang, X. Min, C. Pan, “Two-stream designed 2d/3d residual networks with lstms for action recognition in videos,” IEEE International Conference on Image Processing (ICIP)., 2018.
    [34] L. Li, Z. Zhang, Y. Huang, L. Wang, “Deep temporal feature encoding for action recognition,” International Conference on Pattern Recognition (ICPR)., 2018.
    [35] C. Dai, X. Liu, L. Zhong, T. Yu, “Video based action recognition using spatial and temporal feature,” IEEE International Conference on Internet of Things (iThings)., 2018.
    [36] T. Lu, S. Ai, Y. J, Y. Xiong, F. Min, “Deep optical flow feature fusion based on 3d convolutional networks for video action recognition,” IEEE SmartWorld., 2018.
    [37] S. Park, D. Kim, “Study on 3d action recognition based on deep neural network,” International Conference on Electronics, Information, and Communication (ICEIC)., 2019.
    [38] Z. Gokce, S. Pehliyan, “Human action recognition in first person videos using verb-object pairs,” Signal Processing and Communications Applications Conference (SIU)., 2019.
    [39] Z. Tu, et al, “Action-stage emphasized spatiotemporal vlad for video action recognition,” IEEE Transactions on Image Processing., vol. 28, no. 6, pp. 2799-2812, 2019.
    [40] H. C. Lee, C. Y. Lin, P. C. Hsu, W. H. Hsu, “Audio feature generation for missing modality problem in video action recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 2019.
    [41] L. Wang, D. Q. Huynh, P. Koniusz, “A comparative review of recent kinect-based action recognition,” IEEE Transactions on Image Processing., vol. 29, pp. 15-28. 2019.
    [42] Bair, Y. Jia, E. Shelhamer, Caffe. [Online]. Available: https://caffe.berkeleyvision.org/
    [43] S. Hochreiter, & J. Schmidhuber, “Long short-term memory,” Neural computation., vol. 9, no. 8, pp. 1735-1780. 1997.
    [44] B. Ken, D. Joan, “Calculus: Concepts and Methods,” Second derivatives, Cambridge University Press, 7th ed. New York, NY, USA: Cambridge University Press, 2007, pp. 217. [Online]. Available: http://www.ru.ac.bd/
    [45] H. Anton, I. Bivens, S. Davis, “Calculus,” Partial Derivatives, Laurie Rosatone, 10th ed. New York, NY, USA: John Wiley & Sons, Inc., 2009, pp. 988. [Online]. Available: https://arslanhelpyoucom.files.wordpress.com/
    [46] D. P. Kingma, J. L. Ba, “Adam: a method for stochastic optimization,” International Conference on Learning Representations., 2015, arXiv preprint arXxiv:1412.6980.
    [47] C. M. Bishop, “Pattern recognition and machine learning,” Neural networks, Christopher M. Bishop, 2nd ed. New York, NY, USA: Springer, 2006, pp. 198. [Online]. Available: http://users.isr.ist.utl.pt/
    [48] I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning,” Machine learning basics, MIT Press, 1st ed. Cambridge, MA, USA: MIT Press book, 2016, pp. 147. [Online]. Available: http://www.deeplearningbook.org/
    [49] Y. Bengio, A. Courville, P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 35, no. 8, pp. 1798-1828, 2013.
    [50] H. A. Song, S. Y. Lee, “Hierarchical representation using NMF,” Lecture Notes in Computer Science., 2013.
    [51] B. K. P. Horn, B. G. Schunck, “Determining optical flow,” Techniques and Applications of Image Understanding., 1981.
    [52] C. Szegedy, et al, “Going deeper with convolutions,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR)., 2015.
    [53] S. Bai, J. Z. Kolter, V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” 2018, arXiv preprint arXxiv:1803.01271.
    [54] P. W. Battaglia, et al, “Relational inductive biases, deep learning, and graph networks,” 2018, arXiv preprint arXxiv:1806.01261.

    無法下載圖示 全文公開日期 2025/07/20 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE