簡易檢索 / 詳目顯示

研究生: Erick Hendra Putra Alwando
Erick Hendra Putra Alwando
論文名稱: 高效多路徑搜尋之影片動作偵測
Efficient Multiple Path Search for Action Tube Detection in Videos
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 丘建青
Chien-Ching Chiu
賴坤財
Kuen-Tsair Lay
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 49
中文關鍵詞: action localizationconvolutional neural networks (CNN)multiple path searchlocalization refinementobject detection
外文關鍵詞: action localization, convolutional neural networks (CNN), multiple path search, localization refinement, object detection
相關次數: 點閱:339下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

This thesis presents an efficient convolutional neural network (CNN)-based approach to detect multiple spatial-temporal action tubes in videos. First, a new fusion strategy is employed, which combines the appearance and the flow information out of the two-stream CNN-based networks along with motion saliency to generate the action detection scores. Thereafter, an efficient multiple path search (MPS) algorithm, is developed to simultaneously
find multiple paths in a single run. In the forward message passing of MPS, each node stores information of a prescribed number of paths based on the accumulated scores determined in the previous stages. A backward path tracing is invoked afterward to find all multiple paths at the same time by fully reusing the information generated in the forward pass without repeating the search process. Thereby, the complexity incurred can be reduced. Moreover, to rectify the potentially inaccurate bounding boxes, a video localization refinement (VLR) scheme is also addressed to further boost the detection accuracy. Simulations show that the proposed MPS provides superior performance compared with the main state-of-the-art works on the widespread UCF-101 and J-HMDB datasets. Together with VLR, the performance of MPS can be further bolstered.


This thesis presents an efficient convolutional neural network (CNN)-based approach to detect multiple spatial-temporal action tubes in videos. First, a new fusion strategy is employed, which combines the appearance and the flow information out of the two-stream CNN-based networks along with motion saliency to generate the action detection scores. Thereafter, an efficient multiple path search (MPS) algorithm, is developed to simultaneously
find multiple paths in a single run. In the forward message passing of MPS, each node stores information of a prescribed number of paths based on the accumulated scores determined in the previous stages. A backward path tracing is invoked afterward to find all multiple paths at the same time by fully reusing the information generated in the forward pass without repeating the search process. Thereby, the complexity incurred can be reduced. Moreover, to rectify the potentially inaccurate bounding boxes, a video localization refinement (VLR) scheme is also addressed to further boost the detection accuracy. Simulations show that the proposed MPS provides superior performance compared with the main state-of-the-art works on the widespread UCF-101 and J-HMDB datasets. Together with VLR, the performance of MPS can be further bolstered.

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 CNNs-based Action Classifiers . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Video Localization Refinement . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Fusion Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Multiple Path Search Algorithm . . . . . . . . . . . . . . . . . . . . . 15 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Evaluation Protocol and Experimental Setup . . . . . . . . . . . . . . 20 4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.1 The New Fusion Strategy . . . . . . . . . . . . . . . . . . . . 20 4.3.2 Impact of K Parameter . . . . . . . . . . . . . . . . . . . . . 21 4.4 Comparisons with the State-of-the-Art Methods . . . . . . . . . . . . 22 4.5 Computation Time Analysis . . . . . . . . . . . . . . . . . . . . . . . 25 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Appendix A: Example images from the datasets . . . . . . . . . . . . . . . . . 30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

[1] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, \Deep learning
for detecting multiple space-time action tubes in videos," in Proceedings of the
British Machine Vision Conference, 2016.
[2] S. Ren, K. He, R. Girshick, and J. Sun, \Faster R-CNN: Towards real-time
object detection with region proposal networks," Neural Information Processing
Systems (NIPS), pp. 91{99, 2015.
[3] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, \Ssd: Single shot multibox detector," in Proceedings of the European
Conference on Computer Vision, pp. 21{37, 2016.
[4] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, \You only look once:
Uni ed, real-time object detection," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 779{788, 2016.
[5] G. Gkioxari and J. Malik, \Finding action tubes," in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 759{768, 2015.
[6] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, \Learning to track for spatiotemporal
action localization," in Proceedings of the IEEE International Con-
ference on Computer Vision, pp. 3164{3172, 2015.
[7] G. Yu and J. Yuan, \Fast action proposals for human action detection and
search," in Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pp. 1302{1311, 2015.
[8] A. Klaser, M. Marsza lek, C. Schmid, and A. Zisserman, \Human focused action
localization in video," in Proceedings of the European Conference on Computer
Vision, pp. 219{233, 2010.
[9] D. Oneata, J. Verbeek, and C. Schmid, \Action and event recognition with
sher vectors on a compact feature set," in Proceedings of the IEEE Interna-
tional Conference on Computer Vision, pp. 1817{1824, 2013.
[10] Y. Tian, R. Sukthankar, and M. Shah, \Spatiotemporal deformable part models
for action detection," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2642{2649, 2013.
[11] Z. Shu, K. Yun, and D. Samaras, \Action detection with improved dense trajectories
and sliding window," in Proceedings of the European Conference on
Computer Vision, pp. 541{551, 2014.
[12] J. C. van Gemert, M. Jain, E. Gati, and C. G. M. Snoek, \Apt: Action localization
proposals from dense trajectories," in Proceedings of the British Machine
Vision Conference, pp. 177.1{177.12, 2015.
[13] R. Sibson, \Slink: an optimally ecient algorithm for the single-link cluster
method," The Computer Journal, pp. 30{34, 1973.
[14] P. Mettes, J. C. van Gemert, and C. G. M. Snoek, \Spot on: Action localization
from pointly-supervised proposals.," in Proceedings of the European Conference
on Computer Vision, pp. 437{453, 2016.
[15] L. Wang, Y. Qiao, X. Tang, and L. Van Gool, \Actionness estimation using
hybrid fully convolutional networks," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2708{2717, 2016.
[16] E. H. P. Alwando, Y. T. Chen, and W. H. Fang, \Multiple path search for action
tube detection in videos," in Proceedings of the IEEE International Conference
on Image Processing, 2017.
[17] C. Ming-Ming, Z. Zhang, W. Y. Lin, and P. Torr, \Bing: Binarized normed
gradients for objectness estimation at 300fps," in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. 3286{3293, 2014.
[18] P. Rantalankila, J. Kannala, and E. Rahtu, \Generating object segmentation
proposals using global and local search," in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2417{2424, 2014.
[19] I. Endres and D. Hoiem, \Category-independent object proposals with diverse
ranking," IEEE Transactions on Pattern Analysis and Machine Intelligence,
pp. 222{234, 2014.
[20] S. Manen, M. Guillaumin, and L. V. Gool, \Prime object proposals with randomized
prim's algorithm," in Proceedings of the IEEE International Confer-
ence on Computer Vision, pp. 2536{2543, 2013.
[21] J. Carreira and C. Sminchisescu, \Cpmc: Automatic object segmentation using
constrained parametric min-cuts," IEEE Transactions on Pattern Analysis and
Machine Intelligence, pp. 1312{1328, 2012.
[22] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, \Selective
search for object recognition," International Journal of Computer Vision,
pp. 154{171, 2013.
[23] C. L. Zitnick and P. Dollar, \Edge boxes: Locating object proposals from
edges," in Proceedings of the European Conference on Computer Vision,
pp. 391{405, 2014.
[24] H. Wang, A. Klser, C. Schmid, and C.-L. Liu, \Action recognition by dense
trajectories," in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3169{3176, 2011.
[25] X. Zhang, Y.-H. Yang, Z. Han, H. Wang, and C. Gao, \Object class detection:
A survey," ACM Computing Surveys (CSUR), p. 10, 2013.
[26] Y. Li, K. He, and J. Sun, \R-fcn: Object detection via region-based fully
convolutional networks," Advances in neural information processing systems,
pp. 379{387, 2016.
[27] R. B. G. D. M. Felzenszwalb, Pedro F. and D. Ramanan, \Object detection with
discriminatively trained part-based models," IEEE Transactions on Pattern
Analysis and Machine Intelligence, pp. 1627{1645, 2010.
[28] N. Dalal and B. Triggs, \Histograms of oriented gradients for human detection,"
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 886{893, 2005.
[29] H. Wang, D. Oneata, J. Verbeek, and C. Schmid, \A robust and ecient video
representation for action recognition," International Journal of Computer Vi-
sion, pp. 219{238, 2006.
[30] J. Snchez, F. Perronnin, T. Mensink, and J. Verbeek, \Image classi cation with
the sher vector: Theory and practice," International Journal of Computer
Vision, pp. 222{245, 2013.
[31] J. Sivic and A. Zisserman, \Video google: A text retrieval approach to object
matching in videos," in Proceedings of the IEEE International Conference on
Computer Vision, pp. 1470{1477, 2003.
[32] D. G. Lowe, \Distinctive image features from scale-invariant keypoints," Inter-
national Journal of Computer Vision, pp. 91{110, 2004.
[33] S. Lazebnik, C. Schmid, and J. Ponce, \Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 2169{2178,
2006.
[34] T. Ahonen, A. Hadid, and M. Pietikinen, \Face recognition with local binary
patterns," in Proceedings of the European Conference on Computer Vision,
pp. 469{481, 2004.
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \Imagenet classi cation with
deep convolutional neural networks," Advances in neural information processing
systems, pp. 1097{1105, 2012.
[36] R. Girshick, J. Donahue, T. Darrell, and J. Malik, \Rich feature hierarchies for
accurate object detection and semantic segmentation," in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 580{587,
2014.
[37] R. Girshick, \Fast R-CNN," in Proceedings of the International Conference on
Computer Vision, 2015.
[38] K. He, X. Zhang, S. Ren, and J. Sun, \Spatial pyramid pooling in deep convolutional
networks for visual recognition," in European Conference on Computer
Vision, pp. 346{361, Springer, 2014.
[39] M. D. Zeiler and R. Fergus, \Visualizing and understanding convolutional
networks," in Proceedings of the European Conference on Computer Vision,
pp. 818{833, 2014.
[40] K. Simonyan and A. Zisserman, \Very deep convolutional networks for largescale
image recognition," arXiv preprint arXiv, pp. 1409{1556, 2014.
[41] J. Donahue, Y. Jia, O. Vinyals, J. Ho man, N. Zhang, E. Tzeng, and T. Darrell,
\Decaf: A deep convolutional activation feature for generic visual recognition,"
in Proceedings of the International Conference on Machine Learning, pp. 647{
655, 2014.
[42] A. Gaidon, Z. Harchaoui, and C. Schmid, \Temporal localization of actions with
actoms," IEEE Transactions on Pattern Analysis and Machine Intelligence,
pp. 2782{2795, 2013.
[43] I. Laptev and P. Perez, \Retrieving actions in movies," in Proceedings of the
IEEE International Conference on Computer Vision, pp. 1{8, 2007.
[44] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, \Object
detection with discriminatively trained part-based models," IEEE Transactions
on Pattern Analysis and Machine Intelligence, pp. 1627{1645, 2010.
[45] A. Klaser, M. Marszaek, and C. Schmid, \A spatio-temporal descriptor based
on 3d-gradients," in Proceedings of the British Machine Vision Conference,
2008.
[46] G. Evangelidis, G. Singh, and R. Horaud, \Continuous gesture recognition
from articulated poses," European Conference on Computer Vision Workshops,
pp. 595{607, 2014.
[47] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, \High accuracy optical
ow estimation based on a theory for warping," in Proceedings of the European
Conference on Computer Vision, pp. 25{36, Springer, 2004.
[48] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, \Going deeper with convolutions," in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1{9, 2015.
[49] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
\Overfeat: Integrated recognition, localization and detection using convolutional
networks," in Proceedings of the International Conference on Learning
Representations, 2014.
[50] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, \Scalable object detection
using deep neural networks," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2147{2154, 2014.
[51] K. W. Cheng, Y. T. Chen, and W. H. Fang, \Improved object detection with
iterative localization re nement in convolutional neural networks," in Proceed-
ings of the IEEE International Conference on Image Processing, pp. 3643{3647,
2016.
[52] S. Gidaris and N. Komodakis, \Locnet: Improving localization accuracy for
object detection," in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 789{798, 2016.
[53] K. Simonyan and A. Zisserman, \Two-stream convolutional networks for action
recognition in videos," Advances in neural information processing systems,
pp. 568{576, 2014.
[54] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,
and L. D. Jackel, \Backpropagation applied to handwritten zip code recognition,"
Neural computation, pp. 541{551, 1989.
[55] D. Comaniciu and P. Meer, \Mean shift: A robust approach toward feature
space analysis," IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, pp. 603{619, 2002.
[56] M. Everingham, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman, \The pascal visual object
classes challenge 2007 (voc2007) results." http://www.pascalnetwork.
org/challenges/VOC/voc2007/workshop/index.html, 2007.
[57] E. K. Chong and S. H. Zak, An Introduction to Optimization. John Wiley &
Sons, 2013.
[58] K. Soomro, A. R. Zamir, and M. Shah, \Ucf101: A dataset of 101 human
actions classes from videos in the wild," CRCV-TR-12-01, 2012.
[59] H. Jhuang, J. Gall, S. Zu, C. Schmid, and M. J. Black, \Towards understanding
action recognition," in Proceedings of the IEEE Conference on on Computer
Vision, pp. 3192{3199, 2013.

QR CODE