研究生: |
Rizard Renanda Adhi Pramono Rizard Renanda Adhi Pramono |
---|---|
論文名稱: |
基於摺積神經網路的空間與時間注意力模型應用於影片動作偵測 CNN-based Action Tube Detection in Videos Using Spatio-Temporal Attention Module |
指導教授: |
方文賢
Wen-Hsien Fang 陳郁堂 Yie-Tarng Chen |
口試委員: |
賴坤財
Kuen-Tsair Lay 丘建青 Chien-Ching Chiu 李枝宏 Ju-Hong Lee |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 66 |
中文關鍵詞: | action localization 、action tube 、long short-term memory 、convolutional neural network 、localization refinement |
外文關鍵詞: | action localization, action tube, long short-term memory, convolutional neural network, localization refinement |
相關次數: | 點閱:269 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
This thesis presents an effective convolutional neural network (CNN)-based
method to generate spatio-temporal action tubes for action localization in videos.
First, a sequential localization refinement (SLR) is addressed to refine inaccurate
bounding boxes generated by two-stream CNN-based networks. Next, a sequence
re-scoring (SR) algorithm is employed to resolve the low detection scores due to occlusion. Thereafter, a new fusion strategy is invoked, which integrates not only the
appearance and motion information from the two-stream detection networks, but
also the motion saliency, to alleviate the small camera motion and sequential information from two-stream attention-based long short-term memory (LSTM) networks,
to provide more reliable detection scores. Furthermore, an efficient multiple path
search (MPS) algorithm is utilized to simultaneously find multiple paths in a single
run. A median filter is also used to reduce the inconsistent path scores to assist the
temporal trimming algorithm in handling the temporal localization. Simulations
show that our proposed method in general outperforms the main state-of-the-art
works on the widespread UCF-101, J-HMDB, and UCF-Sports datasets.
This thesis presents an effective convolutional neural network (CNN)-based
method to generate spatio-temporal action tubes for action localization in videos.
First, a sequential localization refinement (SLR) is addressed to refine inaccurate
bounding boxes generated by two-stream CNN-based networks. Next, a sequence
re-scoring (SR) algorithm is employed to resolve the low detection scores due to occlusion. Thereafter, a new fusion strategy is invoked, which integrates not only the
appearance and motion information from the two-stream detection networks, but
also the motion saliency, to alleviate the small camera motion and sequential information from two-stream attention-based long short-term memory (LSTM) networks,
to provide more reliable detection scores. Furthermore, an efficient multiple path
search (MPS) algorithm is utilized to simultaneously find multiple paths in a single
run. A median filter is also used to reduce the inconsistent path scores to assist the
temporal trimming algorithm in handling the temporal localization. Simulations
show that our proposed method in general outperforms the main state-of-the-art
works on the widespread UCF-101, J-HMDB, and UCF-Sports datasets.
[1] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatiotemporal action localization," in Proceedings of the IEEE International Conference on Computer Vision, pp. 3164-3172, 2015.
[2] G. Gkioxari and J. Malik, “Finding action tubes," in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 759-768, 2015.
[3] L. Wang, Y. Qiao, X. Tang, and L. Van Gool, “Actionness estimation using
hybrid fully convolutional networks," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2708-2717, 2016.
[4] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, “Deep learning
for detecting multiple space-time action tubes in videos," in Proceedings of the
British Machine Vision Conference, 2016.
[5] X. Peng and C. Schmid, “Multi-region two-stream R-CNN for action detection,"
in Proceedings of the IEEE European Conference on Computer Vision, pp. 744-
759, 2016. Part of the results is from https://github.com/pengxj/action-fasterrcnn.
[6] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained multi-stream
networks exploiting pose, motion, and appearance for action classification and
detection," in Proceedings of the IEEE International Conference on Computer
Vision, pp. 2904-2913, 2017.
[7] E. H. P. Alwando, Y. T. Chen, and W. H. Fang, “Multiple path search for action
tube detection in videos," in Proceedings of the IEEE International Conference
on Image Processing, pp. 4232-4236, 2017.
[8] G. Singh, S. Saha, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, “Online realtime multiple spatiotemporal action localisation and prediction," in Proceedings
of the IEEE International Conference on Computer Vision, 2017.
[9] Z. Yang, J. Gao, and R. Nevatia, “Spatio-temporal action detection with cascade proposal and location anticipation," in Proceedings of the British Machine
Vision Conference, 2017.
[10] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN)
for action detection in videos," in Proceedings of the IEEE International Conference on Computer Vision, pp. 5822-5831, 2017.
[11] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet
detector for spatio-temporal action localization," in Proceedings of the IEEE
International Conference on Computer Vision, pp. 4405-4413, 2017.
[12] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “VideoLSTM convolves, attends and flows for action recognition," Computer Vision and Image
Understanding, vol. 166, pp. 41 - 50, 2018.
[13] K. W. Cheng, Y. T. Chen, and W. H. Fang, “Improved object detection with iterative localization refinement in convolutional neural networks," IEEE Transactions on Circuits and Systems for Video Technology, pp. 1-1, 2017.
[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587,
2014.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition," in Proceedings of the European Conference on Computer Vision, pp. 346-361, Springer, 2014.
[16] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
networks," in Proceedings of the European Conference on Computer Vision,
pp. 818-833, 2014.
[17] R. Girshick, “Fast R-CNN," in Proceedings of the International Conference on
Computer Vision, 2015
[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition," arXiv preprint arXiv, pp. 1409-1556, 2014.
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
object detection with region proposal networks," in Proceedings of the Neural
Information Processing Systems, pp. 91-99, 2015.
[20] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based
fully convolutional networks," in Proceedings of the Advances in Neural Information Processing Systems, pp. 379-387, 2016.
[21] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN," in Proceedings
of the IEEE International Conference on Computer Vision, pp. 2961-2969,
2017.
[22] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once:
Unified, real-time object detection," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 779-788, 2016.
[23] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 6517-6525, 2017.
[24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “SSD: Single shot multibox detector," in Proceedings of the European
Conference on Computer Vision, pp. 21-37, 2016.
[25] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,
Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy trade-offs
for modern convolutional object detectors," arXiv preprint arXiv:1611.10012,
2016.
[26] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part models
for action detection," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2642-2649, 2013.
[27] A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of actions with
actoms," IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 35, no. 11, pp. 2782-2795, 2013.
[28] D. Oneata, J. Verbeek, and C. Schmid, “Action and event recognition with
fisher vectors on a compact feature set," in Proceedings of the IEEE International Conference on Computer Vision, pp. 1817-1824, 2013.
[29] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object
detection with discriminatively trained part-based models," IEEE Transactions
on Pattern Analysis and Machine Intelligence, pp. 1627-1645, 2010.
[30] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, no. 1, pp. 221-231, 2013.
[31] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.
[32] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos," Proceedings of the Neural Information Processing
Systems, pp. 568-576, 2014.
[33] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multiplier networks for video action recognition," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 4768-4777, 2017.
[34] V. Veeriah, N. Zhuang, and G. J. Qi, “Differential recurrent neural networks
for action recognition," in Proceedings of the IEEE International Conference
on Computer Vision, pp. 4041-4049, 2015.
[35] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream
bi-directional recurrent neural network for fine-grained action detection," in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961-1970, 2016.
[36] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents in
dashcam videos," in Proceedings of the Asian Conference on Computer Vision,
pp. 136-153, 2016.
[37] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei,
\Every moment counts: Dense detailed labeling of actions in complex videos,"
International Journal of Computer Vision, vol. 126, no. 2, pp. 375-389, 2018.
[38] I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-based
action recognition using temporal sliding LSTM networks," in Proceedings of
the IEEE International Conference on Computer Vision, pp. 1012-1020, 2017.
[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1-9, 2015.
[40] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
\Overfeat: Integrated recognition, localization and detection using convolutional networks," in Proceedings of the International Conference on Learning
Representations, 2014.
[41] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition," International Journal of Computer Vision,
vol. 104, no. 2, pp. 154-171, 2013.
[42] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2147-2154, 2014.
[43] S. Gidaris and N. Komodakis, “Locnet: Improving localization accuracy for
object detection," in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 789-798, 2016.
[44] W. Han, P. Khorrami, T. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi,
J. Li, S. Yan, and T. S. Huang, “Seq-NMS for video object detection," in
Proceedings of the International Conference on Learning Representations, 2016.
[45] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical
flow estimation based on a theory for warping," in Proceedings of the European
Conference on Computer Vision, pp. 25-36, Springer, 2004.
[46] M. Everingham, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman, “The PASCAL visual object
classes challenge 2007 (voc2007) results." http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html, 2007.
[47] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks," IEEE
Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
[48] S. Hochreiter and J. Schmidhuber, “Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[49] A. C. Bovik, Handbook of image and video processing. Academic press, 2010.
[50] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human
actions classes from videos in the wild," CRCV-TR-12-01, 2012.
[51] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition," in Proceedings of the IEEE Conference on on Computer
Vision, pp. 3192-3199, 2013.
[52] K. Soomro and A. R. Zamir, “Action recognition in realistic sports videos," in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014.
[53] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.
[54] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,
D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke,
Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning," in Proceedings of the USENIX Conference on Operating Systems Design and
Implementation, pp. 265-283, 2016.