簡易檢索 / 詳目顯示

研究生: Rizard Renanda Adhi Pramono
Rizard Renanda Adhi Pramono
論文名稱: 基於摺積神經網路的空間與時間注意力模型應用於影片動作偵測
CNN-based Action Tube Detection in Videos Using Spatio-Temporal Attention Module
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 賴坤財
Kuen-Tsair Lay
丘建青
Chien-Ching Chiu
李枝宏
Ju-Hong Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 66
中文關鍵詞: action localizationaction tubelong short-term memoryconvolutional neural networklocalization refinement
外文關鍵詞: action localization, action tube, long short-term memory, convolutional neural network, localization refinement
相關次數: 點閱:270下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

This thesis presents an effective convolutional neural network (CNN)-based
method to generate spatio-temporal action tubes for action localization in videos.
First, a sequential localization refinement (SLR) is addressed to refine inaccurate
bounding boxes generated by two-stream CNN-based networks. Next, a sequence
re-scoring (SR) algorithm is employed to resolve the low detection scores due to occlusion. Thereafter, a new fusion strategy is invoked, which integrates not only the
appearance and motion information from the two-stream detection networks, but
also the motion saliency, to alleviate the small camera motion and sequential information from two-stream attention-based long short-term memory (LSTM) networks,
to provide more reliable detection scores. Furthermore, an efficient multiple path
search (MPS) algorithm is utilized to simultaneously find multiple paths in a single
run. A median filter is also used to reduce the inconsistent path scores to assist the
temporal trimming algorithm in handling the temporal localization. Simulations
show that our proposed method in general outperforms the main state-of-the-art
works on the widespread UCF-101, J-HMDB, and UCF-Sports datasets.


This thesis presents an effective convolutional neural network (CNN)-based
method to generate spatio-temporal action tubes for action localization in videos.
First, a sequential localization refinement (SLR) is addressed to refine inaccurate
bounding boxes generated by two-stream CNN-based networks. Next, a sequence
re-scoring (SR) algorithm is employed to resolve the low detection scores due to occlusion. Thereafter, a new fusion strategy is invoked, which integrates not only the
appearance and motion information from the two-stream detection networks, but
also the motion saliency, to alleviate the small camera motion and sequential information from two-stream attention-based long short-term memory (LSTM) networks,
to provide more reliable detection scores. Furthermore, an efficient multiple path
search (MPS) algorithm is utilized to simultaneously find multiple paths in a single
run. A median filter is also used to reduce the inconsistent path scores to assist the
temporal trimming algorithm in handling the temporal localization. Simulations
show that our proposed method in general outperforms the main state-of-the-art
works on the widespread UCF-101, J-HMDB, and UCF-Sports datasets.

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Action Detection Network . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Sequential Localization Refinement . . . . . . . . . . . . . . . . . . . 10 3.4 Sequence Re-scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Fusion Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5.1 Attention-based Long Short-term Memory . . . . . . . . . . . 15 3.5.2 Motion Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5.3 Boost Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5.4 Union Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.6 Multiple Path Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.7 Temporal Trimming with Median Filter . . . . . . . . . . . . . .. . . 21 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Evaluation Protocol and Experimental Setup . . . . . . . . . . . . . . 25 4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.1 Impact of the Motion Saliency . . . . . . . . . . . . . . . . . . 28 4.3.2 Impact of the Two-stream Attention-based LSTM . . . . . . . 28 4.3.3 Impact of SLR . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.4 Impact of SR . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.5 Impact of Median Filters . . . . . . . . . . . . . . . . . . . . . 31 4.3.6 Impact of Fusion Strategies . . . . . . . . . . . . . . . . . . . 32 4.4 Comparisons with State-of-the-Art Methods . . . . . . . . . . . . . . 33 5 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Appendix A: Example images from the datasets . . . . . . . . . . . . . . . . . 41 Appendix B: Some spatio-temporal localization result on untrimmed UCF-101 videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Appendix C : Visualization of the attention weights generated by the attention module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

[1] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatiotemporal action localization," in Proceedings of the IEEE International Conference on Computer Vision, pp. 3164-3172, 2015.
[2] G. Gkioxari and J. Malik, “Finding action tubes," in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 759-768, 2015.
[3] L. Wang, Y. Qiao, X. Tang, and L. Van Gool, “Actionness estimation using
hybrid fully convolutional networks," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2708-2717, 2016.
[4] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, “Deep learning
for detecting multiple space-time action tubes in videos," in Proceedings of the
British Machine Vision Conference, 2016.
[5] X. Peng and C. Schmid, “Multi-region two-stream R-CNN for action detection,"
in Proceedings of the IEEE European Conference on Computer Vision, pp. 744-
759, 2016. Part of the results is from https://github.com/pengxj/action-fasterrcnn.
[6] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained multi-stream
networks exploiting pose, motion, and appearance for action classification and
detection," in Proceedings of the IEEE International Conference on Computer
Vision, pp. 2904-2913, 2017.
[7] E. H. P. Alwando, Y. T. Chen, and W. H. Fang, “Multiple path search for action
tube detection in videos," in Proceedings of the IEEE International Conference
on Image Processing, pp. 4232-4236, 2017.
[8] G. Singh, S. Saha, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, “Online realtime multiple spatiotemporal action localisation and prediction," in Proceedings
of the IEEE International Conference on Computer Vision, 2017.
[9] Z. Yang, J. Gao, and R. Nevatia, “Spatio-temporal action detection with cascade proposal and location anticipation," in Proceedings of the British Machine
Vision Conference, 2017.
[10] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN)
for action detection in videos," in Proceedings of the IEEE International Conference on Computer Vision, pp. 5822-5831, 2017.
[11] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet
detector for spatio-temporal action localization," in Proceedings of the IEEE
International Conference on Computer Vision, pp. 4405-4413, 2017.
[12] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “VideoLSTM convolves, attends and flows for action recognition," Computer Vision and Image
Understanding, vol. 166, pp. 41 - 50, 2018.
[13] K. W. Cheng, Y. T. Chen, and W. H. Fang, “Improved object detection with iterative localization refinement in convolutional neural networks," IEEE Transactions on Circuits and Systems for Video Technology, pp. 1-1, 2017.
[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587,
2014.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition," in Proceedings of the European Conference on Computer Vision, pp. 346-361, Springer, 2014.
[16] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
networks," in Proceedings of the European Conference on Computer Vision,
pp. 818-833, 2014.
[17] R. Girshick, “Fast R-CNN," in Proceedings of the International Conference on
Computer Vision, 2015
[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition," arXiv preprint arXiv, pp. 1409-1556, 2014.
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
object detection with region proposal networks," in Proceedings of the Neural
Information Processing Systems, pp. 91-99, 2015.
[20] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based
fully convolutional networks," in Proceedings of the Advances in Neural Information Processing Systems, pp. 379-387, 2016.
[21] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN," in Proceedings
of the IEEE International Conference on Computer Vision, pp. 2961-2969,
2017.
[22] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once:
Unified, real-time object detection," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 779-788, 2016.
[23] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 6517-6525, 2017.
[24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “SSD: Single shot multibox detector," in Proceedings of the European
Conference on Computer Vision, pp. 21-37, 2016.
[25] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,
Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy trade-offs
for modern convolutional object detectors," arXiv preprint arXiv:1611.10012,
2016.
[26] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part models
for action detection," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2642-2649, 2013.
[27] A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of actions with
actoms," IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 35, no. 11, pp. 2782-2795, 2013.
[28] D. Oneata, J. Verbeek, and C. Schmid, “Action and event recognition with
fisher vectors on a compact feature set," in Proceedings of the IEEE International Conference on Computer Vision, pp. 1817-1824, 2013.
[29] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object
detection with discriminatively trained part-based models," IEEE Transactions
on Pattern Analysis and Machine Intelligence, pp. 1627-1645, 2010.
[30] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, no. 1, pp. 221-231, 2013.
[31] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.
[32] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos," Proceedings of the Neural Information Processing
Systems, pp. 568-576, 2014.
[33] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks for video action recognition," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 4768-4777, 2017.
[34] V. Veeriah, N. Zhuang, and G. J. Qi, “Differential recurrent neural networks
for action recognition," in Proceedings of the IEEE International Conference
on Computer Vision, pp. 4041-4049, 2015.
[35] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream
bi-directional recurrent neural network for fine-grained action detection," in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961-1970, 2016.
[36] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents in
dashcam videos," in Proceedings of the Asian Conference on Computer Vision,
pp. 136-153, 2016.
[37] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei,
\Every moment counts: Dense detailed labeling of actions in complex videos,"
International Journal of Computer Vision, vol. 126, no. 2, pp. 375-389, 2018.
[38] I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-based
action recognition using temporal sliding LSTM networks," in Proceedings of
the IEEE International Conference on Computer Vision, pp. 1012-1020, 2017.
[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1-9, 2015.
[40] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
\Overfeat: Integrated recognition, localization and detection using convolutional networks," in Proceedings of the International Conference on Learning
Representations, 2014.
[41] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition," International Journal of Computer Vision,
vol. 104, no. 2, pp. 154-171, 2013.
[42] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2147-2154, 2014.
[43] S. Gidaris and N. Komodakis, “Locnet: Improving localization accuracy for
object detection," in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 789-798, 2016.
[44] W. Han, P. Khorrami, T. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi,
J. Li, S. Yan, and T. S. Huang, “Seq-NMS for video object detection," in
Proceedings of the International Conference on Learning Representations, 2016.
[45] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical
flow estimation based on a theory for warping," in Proceedings of the European
Conference on Computer Vision, pp. 25-36, Springer, 2004.
[46] M. Everingham, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman, “The PASCAL visual object
classes challenge 2007 (voc2007) results." http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html, 2007.
[47] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks," IEEE
Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
[48] S. Hochreiter and J. Schmidhuber, “Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[49] A. C. Bovik, Handbook of image and video processing. Academic press, 2010.
[50] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human
actions classes from videos in the wild," CRCV-TR-12-01, 2012.
[51] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition," in Proceedings of the IEEE Conference on on Computer
Vision, pp. 3192-3199, 2013.
[52] K. Soomro and A. R. Zamir, “Action recognition in realistic sports videos," in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014.
[53] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.
[54] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,
D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke,
Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning," in Proceedings of the USENIX Conference on Operating Systems Design and
Implementation, pp. 265-283, 2016.

QR CODE