簡易檢索 / 詳目顯示

研究生: Rizard Renanda Adhi Pramono
Rizard Renanda Adhi Pramono
論文名稱: 基於摺積神經網路的空間與時間注意力模型應用於影片動作偵測
CNN-based Action Tube Detection in Videos Using Spatio-Temporal Attention Module
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 賴坤財
Kuen-Tsair Lay
丘建青
Chien-Ching Chiu
李枝宏
Ju-Hong Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 66
中文關鍵詞: action localizationaction tubelong short-term memoryconvolutional neural networklocalization refinement
外文關鍵詞: action localization, action tube, long short-term memory, convolutional neural network, localization refinement
相關次數: 點閱:269下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • This thesis presents an effective convolutional neural network (CNN)-based
    method to generate spatio-temporal action tubes for action localization in videos.
    First, a sequential localization refinement (SLR) is addressed to refine inaccurate
    bounding boxes generated by two-stream CNN-based networks. Next, a sequence
    re-scoring (SR) algorithm is employed to resolve the low detection scores due to occlusion. Thereafter, a new fusion strategy is invoked, which integrates not only the
    appearance and motion information from the two-stream detection networks, but
    also the motion saliency, to alleviate the small camera motion and sequential information from two-stream attention-based long short-term memory (LSTM) networks,
    to provide more reliable detection scores. Furthermore, an efficient multiple path
    search (MPS) algorithm is utilized to simultaneously find multiple paths in a single
    run. A median filter is also used to reduce the inconsistent path scores to assist the
    temporal trimming algorithm in handling the temporal localization. Simulations
    show that our proposed method in general outperforms the main state-of-the-art
    works on the widespread UCF-101, J-HMDB, and UCF-Sports datasets.


    This thesis presents an effective convolutional neural network (CNN)-based
    method to generate spatio-temporal action tubes for action localization in videos.
    First, a sequential localization refinement (SLR) is addressed to refine inaccurate
    bounding boxes generated by two-stream CNN-based networks. Next, a sequence
    re-scoring (SR) algorithm is employed to resolve the low detection scores due to occlusion. Thereafter, a new fusion strategy is invoked, which integrates not only the
    appearance and motion information from the two-stream detection networks, but
    also the motion saliency, to alleviate the small camera motion and sequential information from two-stream attention-based long short-term memory (LSTM) networks,
    to provide more reliable detection scores. Furthermore, an efficient multiple path
    search (MPS) algorithm is utilized to simultaneously find multiple paths in a single
    run. A median filter is also used to reduce the inconsistent path scores to assist the
    temporal trimming algorithm in handling the temporal localization. Simulations
    show that our proposed method in general outperforms the main state-of-the-art
    works on the widespread UCF-101, J-HMDB, and UCF-Sports datasets.

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Action Detection Network . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Sequential Localization Refinement . . . . . . . . . . . . . . . . . . . 10 3.4 Sequence Re-scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Fusion Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5.1 Attention-based Long Short-term Memory . . . . . . . . . . . 15 3.5.2 Motion Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5.3 Boost Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5.4 Union Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.6 Multiple Path Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.7 Temporal Trimming with Median Filter . . . . . . . . . . . . . .. . . 21 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Evaluation Protocol and Experimental Setup . . . . . . . . . . . . . . 25 4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.1 Impact of the Motion Saliency . . . . . . . . . . . . . . . . . . 28 4.3.2 Impact of the Two-stream Attention-based LSTM . . . . . . . 28 4.3.3 Impact of SLR . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.4 Impact of SR . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.5 Impact of Median Filters . . . . . . . . . . . . . . . . . . . . . 31 4.3.6 Impact of Fusion Strategies . . . . . . . . . . . . . . . . . . . 32 4.4 Comparisons with State-of-the-Art Methods . . . . . . . . . . . . . . 33 5 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Appendix A: Example images from the datasets . . . . . . . . . . . . . . . . . 41 Appendix B: Some spatio-temporal localization result on untrimmed UCF-101 videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Appendix C : Visualization of the attention weights generated by the attention module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    [1] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatiotemporal action localization," in Proceedings of the IEEE International Conference on Computer Vision, pp. 3164-3172, 2015.
    [2] G. Gkioxari and J. Malik, “Finding action tubes," in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 759-768, 2015.
    [3] L. Wang, Y. Qiao, X. Tang, and L. Van Gool, “Actionness estimation using
    hybrid fully convolutional networks," in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 2708-2717, 2016.
    [4] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, “Deep learning
    for detecting multiple space-time action tubes in videos," in Proceedings of the
    British Machine Vision Conference, 2016.
    [5] X. Peng and C. Schmid, “Multi-region two-stream R-CNN for action detection,"
    in Proceedings of the IEEE European Conference on Computer Vision, pp. 744-
    759, 2016. Part of the results is from https://github.com/pengxj/action-fasterrcnn.
    [6] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained multi-stream
    networks exploiting pose, motion, and appearance for action classification and
    detection," in Proceedings of the IEEE International Conference on Computer
    Vision, pp. 2904-2913, 2017.
    [7] E. H. P. Alwando, Y. T. Chen, and W. H. Fang, “Multiple path search for action
    tube detection in videos," in Proceedings of the IEEE International Conference
    on Image Processing, pp. 4232-4236, 2017.
    [8] G. Singh, S. Saha, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, “Online realtime multiple spatiotemporal action localisation and prediction," in Proceedings
    of the IEEE International Conference on Computer Vision, 2017.
    [9] Z. Yang, J. Gao, and R. Nevatia, “Spatio-temporal action detection with cascade proposal and location anticipation," in Proceedings of the British Machine
    Vision Conference, 2017.
    [10] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN)
    for action detection in videos," in Proceedings of the IEEE International Conference on Computer Vision, pp. 5822-5831, 2017.
    [11] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet
    detector for spatio-temporal action localization," in Proceedings of the IEEE
    International Conference on Computer Vision, pp. 4405-4413, 2017.
    [12] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “VideoLSTM convolves, attends and flows for action recognition," Computer Vision and Image
    Understanding, vol. 166, pp. 41 - 50, 2018.
    [13] K. W. Cheng, Y. T. Chen, and W. H. Fang, “Improved object detection with iterative localization refinement in convolutional neural networks," IEEE Transactions on Circuits and Systems for Video Technology, pp. 1-1, 2017.
    [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
    accurate object detection and semantic segmentation," in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587,
    2014.
    [15] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition," in Proceedings of the European Conference on Computer Vision, pp. 346-361, Springer, 2014.
    [16] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
    networks," in Proceedings of the European Conference on Computer Vision,
    pp. 818-833, 2014.
    [17] R. Girshick, “Fast R-CNN," in Proceedings of the International Conference on
    Computer Vision, 2015
    [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition," arXiv preprint arXiv, pp. 1409-1556, 2014.
    [19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
    object detection with region proposal networks," in Proceedings of the Neural
    Information Processing Systems, pp. 91-99, 2015.
    [20] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based
    fully convolutional networks," in Proceedings of the Advances in Neural Information Processing Systems, pp. 379-387, 2016.
    [21] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN," in Proceedings
    of the IEEE International Conference on Computer Vision, pp. 2961-2969,
    2017.
    [22] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once:
    Unified, real-time object detection," in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 779-788, 2016.
    [23] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 6517-6525, 2017.
    [24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
    Berg, “SSD: Single shot multibox detector," in Proceedings of the European
    Conference on Computer Vision, pp. 21-37, 2016.
    [25] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,
    Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy trade-offs
    for modern convolutional object detectors," arXiv preprint arXiv:1611.10012,
    2016.
    [26] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part models
    for action detection," in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, pp. 2642-2649, 2013.
    [27] A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of actions with
    actoms," IEEE Transactions on Pattern Analysis and Machine Intelligence,
    vol. 35, no. 11, pp. 2782-2795, 2013.
    [28] D. Oneata, J. Verbeek, and C. Schmid, “Action and event recognition with
    fisher vectors on a compact feature set," in Proceedings of the IEEE International Conference on Computer Vision, pp. 1817-1824, 2013.
    [29] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object
    detection with discriminatively trained part-based models," IEEE Transactions
    on Pattern Analysis and Machine Intelligence, pp. 1627-1645, 2010.
    [30] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine
    Intelligence, vol. 35, no. 1, pp. 221-231, 2013.
    [31] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the
    IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.
    [32] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos," Proceedings of the Neural Information Processing
    Systems, pp. 568-576, 2014.
    [33] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks for video action recognition," in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 4768-4777, 2017.
    [34] V. Veeriah, N. Zhuang, and G. J. Qi, “Differential recurrent neural networks
    for action recognition," in Proceedings of the IEEE International Conference
    on Computer Vision, pp. 4041-4049, 2015.
    [35] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream
    bi-directional recurrent neural network for fine-grained action detection," in
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961-1970, 2016.
    [36] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents in
    dashcam videos," in Proceedings of the Asian Conference on Computer Vision,
    pp. 136-153, 2016.
    [37] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei,
    \Every moment counts: Dense detailed labeling of actions in complex videos,"
    International Journal of Computer Vision, vol. 126, no. 2, pp. 375-389, 2018.
    [38] I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-based
    action recognition using temporal sliding LSTM networks," in Proceedings of
    the IEEE International Conference on Computer Vision, pp. 1012-1020, 2017.
    [39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
    V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 1-9, 2015.
    [40] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
    \Overfeat: Integrated recognition, localization and detection using convolutional networks," in Proceedings of the International Conference on Learning
    Representations, 2014.
    [41] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition," International Journal of Computer Vision,
    vol. 104, no. 2, pp. 154-171, 2013.
    [42] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks," in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 2147-2154, 2014.
    [43] S. Gidaris and N. Komodakis, “Locnet: Improving localization accuracy for
    object detection," in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, pp. 789-798, 2016.
    [44] W. Han, P. Khorrami, T. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi,
    J. Li, S. Yan, and T. S. Huang, “Seq-NMS for video object detection," in
    Proceedings of the International Conference on Learning Representations, 2016.
    [45] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical
    flow estimation based on a theory for warping," in Proceedings of the European
    Conference on Computer Vision, pp. 25-36, Springer, 2004.
    [46] M. Everingham, L. Van Gool, C. K. I. Williams,
    J. Winn, and A. Zisserman, “The PASCAL visual object
    classes challenge 2007 (voc2007) results." http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html, 2007.
    [47] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks," IEEE
    Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
    [48] S. Hochreiter and J. Schmidhuber, “Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
    [49] A. C. Bovik, Handbook of image and video processing. Academic press, 2010.
    [50] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human
    actions classes from videos in the wild," CRCV-TR-12-01, 2012.
    [51] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition," in Proceedings of the IEEE Conference on on Computer
    Vision, pp. 3192-3199, 2013.
    [52] K. Soomro and A. R. Zamir, “Action recognition in realistic sports videos," in
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014.
    [53] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.
    [54] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,
    D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke,
    Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning," in Proceedings of the USENIX Conference on Operating Systems Design and
    Implementation, pp. 265-283, 2016.

    QR CODE