簡易檢索 / 詳目顯示

研究生: Erick Hendra Putra Alwando
Erick Hendra Putra Alwando
論文名稱: 高效多路徑搜尋之影片動作偵測
Efficient Multiple Path Search for Action Tube Detection in Videos
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 丘建青
Chien-Ching Chiu
賴坤財
Kuen-Tsair Lay
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 49
中文關鍵詞: action localizationconvolutional neural networks (CNN)multiple path searchlocalization refinementobject detection
外文關鍵詞: action localization, convolutional neural networks (CNN), multiple path search, localization refinement, object detection
相關次數: 點閱:327下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • This thesis presents an efficient convolutional neural network (CNN)-based approach to detect multiple spatial-temporal action tubes in videos. First, a new fusion strategy is employed, which combines the appearance and the flow information out of the two-stream CNN-based networks along with motion saliency to generate the action detection scores. Thereafter, an efficient multiple path search (MPS) algorithm, is developed to simultaneously
    find multiple paths in a single run. In the forward message passing of MPS, each node stores information of a prescribed number of paths based on the accumulated scores determined in the previous stages. A backward path tracing is invoked afterward to find all multiple paths at the same time by fully reusing the information generated in the forward pass without repeating the search process. Thereby, the complexity incurred can be reduced. Moreover, to rectify the potentially inaccurate bounding boxes, a video localization refinement (VLR) scheme is also addressed to further boost the detection accuracy. Simulations show that the proposed MPS provides superior performance compared with the main state-of-the-art works on the widespread UCF-101 and J-HMDB datasets. Together with VLR, the performance of MPS can be further bolstered.


    This thesis presents an efficient convolutional neural network (CNN)-based approach to detect multiple spatial-temporal action tubes in videos. First, a new fusion strategy is employed, which combines the appearance and the flow information out of the two-stream CNN-based networks along with motion saliency to generate the action detection scores. Thereafter, an efficient multiple path search (MPS) algorithm, is developed to simultaneously
    find multiple paths in a single run. In the forward message passing of MPS, each node stores information of a prescribed number of paths based on the accumulated scores determined in the previous stages. A backward path tracing is invoked afterward to find all multiple paths at the same time by fully reusing the information generated in the forward pass without repeating the search process. Thereby, the complexity incurred can be reduced. Moreover, to rectify the potentially inaccurate bounding boxes, a video localization refinement (VLR) scheme is also addressed to further boost the detection accuracy. Simulations show that the proposed MPS provides superior performance compared with the main state-of-the-art works on the widespread UCF-101 and J-HMDB datasets. Together with VLR, the performance of MPS can be further bolstered.

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 CNNs-based Action Classifiers . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Video Localization Refinement . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Fusion Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Multiple Path Search Algorithm . . . . . . . . . . . . . . . . . . . . . 15 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Evaluation Protocol and Experimental Setup . . . . . . . . . . . . . . 20 4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.1 The New Fusion Strategy . . . . . . . . . . . . . . . . . . . . 20 4.3.2 Impact of K Parameter . . . . . . . . . . . . . . . . . . . . . 21 4.4 Comparisons with the State-of-the-Art Methods . . . . . . . . . . . . 22 4.5 Computation Time Analysis . . . . . . . . . . . . . . . . . . . . . . . 25 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Appendix A: Example images from the datasets . . . . . . . . . . . . . . . . . 30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    [1] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, \Deep learning
    for detecting multiple space-time action tubes in videos," in Proceedings of the
    British Machine Vision Conference, 2016.
    [2] S. Ren, K. He, R. Girshick, and J. Sun, \Faster R-CNN: Towards real-time
    object detection with region proposal networks," Neural Information Processing
    Systems (NIPS), pp. 91{99, 2015.
    [3] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
    Berg, \Ssd: Single shot multibox detector," in Proceedings of the European
    Conference on Computer Vision, pp. 21{37, 2016.
    [4] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, \You only look once:
    Uni ed, real-time object detection," in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 779{788, 2016.
    [5] G. Gkioxari and J. Malik, \Finding action tubes," in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 759{768, 2015.
    [6] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, \Learning to track for spatiotemporal
    action localization," in Proceedings of the IEEE International Con-
    ference on Computer Vision, pp. 3164{3172, 2015.
    [7] G. Yu and J. Yuan, \Fast action proposals for human action detection and
    search," in Proceedings of the IEEE Conference on Computer Vision and Pat-
    tern Recognition, pp. 1302{1311, 2015.
    [8] A. Klaser, M. Marsza lek, C. Schmid, and A. Zisserman, \Human focused action
    localization in video," in Proceedings of the European Conference on Computer
    Vision, pp. 219{233, 2010.
    [9] D. Oneata, J. Verbeek, and C. Schmid, \Action and event recognition with
    sher vectors on a compact feature set," in Proceedings of the IEEE Interna-
    tional Conference on Computer Vision, pp. 1817{1824, 2013.
    [10] Y. Tian, R. Sukthankar, and M. Shah, \Spatiotemporal deformable part models
    for action detection," in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, pp. 2642{2649, 2013.
    [11] Z. Shu, K. Yun, and D. Samaras, \Action detection with improved dense trajectories
    and sliding window," in Proceedings of the European Conference on
    Computer Vision, pp. 541{551, 2014.
    [12] J. C. van Gemert, M. Jain, E. Gati, and C. G. M. Snoek, \Apt: Action localization
    proposals from dense trajectories," in Proceedings of the British Machine
    Vision Conference, pp. 177.1{177.12, 2015.
    [13] R. Sibson, \Slink: an optimally ecient algorithm for the single-link cluster
    method," The Computer Journal, pp. 30{34, 1973.
    [14] P. Mettes, J. C. van Gemert, and C. G. M. Snoek, \Spot on: Action localization
    from pointly-supervised proposals.," in Proceedings of the European Conference
    on Computer Vision, pp. 437{453, 2016.
    [15] L. Wang, Y. Qiao, X. Tang, and L. Van Gool, \Actionness estimation using
    hybrid fully convolutional networks," in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 2708{2717, 2016.
    [16] E. H. P. Alwando, Y. T. Chen, and W. H. Fang, \Multiple path search for action
    tube detection in videos," in Proceedings of the IEEE International Conference
    on Image Processing, 2017.
    [17] C. Ming-Ming, Z. Zhang, W. Y. Lin, and P. Torr, \Bing: Binarized normed
    gradients for objectness estimation at 300fps," in Proceedings of the IEEE Con-
    ference on Computer Vision and Pattern Recognition, pp. 3286{3293, 2014.
    [18] P. Rantalankila, J. Kannala, and E. Rahtu, \Generating object segmentation
    proposals using global and local search," in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 2417{2424, 2014.
    [19] I. Endres and D. Hoiem, \Category-independent object proposals with diverse
    ranking," IEEE Transactions on Pattern Analysis and Machine Intelligence,
    pp. 222{234, 2014.
    [20] S. Manen, M. Guillaumin, and L. V. Gool, \Prime object proposals with randomized
    prim's algorithm," in Proceedings of the IEEE International Confer-
    ence on Computer Vision, pp. 2536{2543, 2013.
    [21] J. Carreira and C. Sminchisescu, \Cpmc: Automatic object segmentation using
    constrained parametric min-cuts," IEEE Transactions on Pattern Analysis and
    Machine Intelligence, pp. 1312{1328, 2012.
    [22] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, \Selective
    search for object recognition," International Journal of Computer Vision,
    pp. 154{171, 2013.
    [23] C. L. Zitnick and P. Dollar, \Edge boxes: Locating object proposals from
    edges," in Proceedings of the European Conference on Computer Vision,
    pp. 391{405, 2014.
    [24] H. Wang, A. Klser, C. Schmid, and C.-L. Liu, \Action recognition by dense
    trajectories," in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, pp. 3169{3176, 2011.
    [25] X. Zhang, Y.-H. Yang, Z. Han, H. Wang, and C. Gao, \Object class detection:
    A survey," ACM Computing Surveys (CSUR), p. 10, 2013.
    [26] Y. Li, K. He, and J. Sun, \R-fcn: Object detection via region-based fully
    convolutional networks," Advances in neural information processing systems,
    pp. 379{387, 2016.
    [27] R. B. G. D. M. Felzenszwalb, Pedro F. and D. Ramanan, \Object detection with
    discriminatively trained part-based models," IEEE Transactions on Pattern
    Analysis and Machine Intelligence, pp. 1627{1645, 2010.
    [28] N. Dalal and B. Triggs, \Histograms of oriented gradients for human detection,"
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 886{893, 2005.
    [29] H. Wang, D. Oneata, J. Verbeek, and C. Schmid, \A robust and ecient video
    representation for action recognition," International Journal of Computer Vi-
    sion, pp. 219{238, 2006.
    [30] J. Snchez, F. Perronnin, T. Mensink, and J. Verbeek, \Image classi cation with
    the sher vector: Theory and practice," International Journal of Computer
    Vision, pp. 222{245, 2013.
    [31] J. Sivic and A. Zisserman, \Video google: A text retrieval approach to object
    matching in videos," in Proceedings of the IEEE International Conference on
    Computer Vision, pp. 1470{1477, 2003.
    [32] D. G. Lowe, \Distinctive image features from scale-invariant keypoints," Inter-
    national Journal of Computer Vision, pp. 91{110, 2004.
    [33] S. Lazebnik, C. Schmid, and J. Ponce, \Beyond bags of features: Spatial pyramid
    matching for recognizing natural scene categories," in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, pp. 2169{2178,
    2006.
    [34] T. Ahonen, A. Hadid, and M. Pietikinen, \Face recognition with local binary
    patterns," in Proceedings of the European Conference on Computer Vision,
    pp. 469{481, 2004.
    [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \Imagenet classi cation with
    deep convolutional neural networks," Advances in neural information processing
    systems, pp. 1097{1105, 2012.
    [36] R. Girshick, J. Donahue, T. Darrell, and J. Malik, \Rich feature hierarchies for
    accurate object detection and semantic segmentation," in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, pp. 580{587,
    2014.
    [37] R. Girshick, \Fast R-CNN," in Proceedings of the International Conference on
    Computer Vision, 2015.
    [38] K. He, X. Zhang, S. Ren, and J. Sun, \Spatial pyramid pooling in deep convolutional
    networks for visual recognition," in European Conference on Computer
    Vision, pp. 346{361, Springer, 2014.
    [39] M. D. Zeiler and R. Fergus, \Visualizing and understanding convolutional
    networks," in Proceedings of the European Conference on Computer Vision,
    pp. 818{833, 2014.
    [40] K. Simonyan and A. Zisserman, \Very deep convolutional networks for largescale
    image recognition," arXiv preprint arXiv, pp. 1409{1556, 2014.
    [41] J. Donahue, Y. Jia, O. Vinyals, J. Ho man, N. Zhang, E. Tzeng, and T. Darrell,
    \Decaf: A deep convolutional activation feature for generic visual recognition,"
    in Proceedings of the International Conference on Machine Learning, pp. 647{
    655, 2014.
    [42] A. Gaidon, Z. Harchaoui, and C. Schmid, \Temporal localization of actions with
    actoms," IEEE Transactions on Pattern Analysis and Machine Intelligence,
    pp. 2782{2795, 2013.
    [43] I. Laptev and P. Perez, \Retrieving actions in movies," in Proceedings of the
    IEEE International Conference on Computer Vision, pp. 1{8, 2007.
    [44] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, \Object
    detection with discriminatively trained part-based models," IEEE Transactions
    on Pattern Analysis and Machine Intelligence, pp. 1627{1645, 2010.
    [45] A. Klaser, M. Marszaek, and C. Schmid, \A spatio-temporal descriptor based
    on 3d-gradients," in Proceedings of the British Machine Vision Conference,
    2008.
    [46] G. Evangelidis, G. Singh, and R. Horaud, \Continuous gesture recognition
    from articulated poses," European Conference on Computer Vision Workshops,
    pp. 595{607, 2014.
    [47] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, \High accuracy optical
    ow estimation based on a theory for warping," in Proceedings of the European
    Conference on Computer Vision, pp. 25{36, Springer, 2004.
    [48] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
    V. Vanhoucke, and A. Rabinovich, \Going deeper with convolutions," in Pro-
    ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 1{9, 2015.
    [49] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
    \Overfeat: Integrated recognition, localization and detection using convolutional
    networks," in Proceedings of the International Conference on Learning
    Representations, 2014.
    [50] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, \Scalable object detection
    using deep neural networks," in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 2147{2154, 2014.
    [51] K. W. Cheng, Y. T. Chen, and W. H. Fang, \Improved object detection with
    iterative localization re nement in convolutional neural networks," in Proceed-
    ings of the IEEE International Conference on Image Processing, pp. 3643{3647,
    2016.
    [52] S. Gidaris and N. Komodakis, \Locnet: Improving localization accuracy for
    object detection," in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, pp. 789{798, 2016.
    [53] K. Simonyan and A. Zisserman, \Two-stream convolutional networks for action
    recognition in videos," Advances in neural information processing systems,
    pp. 568{576, 2014.
    [54] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,
    and L. D. Jackel, \Backpropagation applied to handwritten zip code recognition,"
    Neural computation, pp. 541{551, 1989.
    [55] D. Comaniciu and P. Meer, \Mean shift: A robust approach toward feature
    space analysis," IEEE Transactions on Pattern Analysis and Machine Intelli-
    gence, pp. 603{619, 2002.
    [56] M. Everingham, L. Van Gool, C. K. I. Williams,
    J. Winn, and A. Zisserman, \The pascal visual object
    classes challenge 2007 (voc2007) results." http://www.pascalnetwork.
    org/challenges/VOC/voc2007/workshop/index.html, 2007.
    [57] E. K. Chong and S. H. Zak, An Introduction to Optimization. John Wiley &
    Sons, 2013.
    [58] K. Soomro, A. R. Zamir, and M. Shah, \Ucf101: A dataset of 101 human
    actions classes from videos in the wild," CRCV-TR-12-01, 2012.
    [59] H. Jhuang, J. Gall, S. Zu, C. Schmid, and M. J. Black, \Towards understanding
    action recognition," in Proceedings of the IEEE Conference on on Computer
    Vision, pp. 3192{3199, 2013.

    QR CODE