簡易檢索 / 詳目顯示

研究生: 莊鎧蔚
KAI-WEI CHUANG
論文名稱: 應用圖像捲積神經網路之空間與時間動作偵測模型
Action Tube Detection Using Graph Convolution Network with Spatio-temporal Self-attention Module
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 賴坤財
Kuen-Tsair Lay
鍾聖倫
Sheng-Luen Chung
丘建青
Chien-Ching Chiu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 68
中文關鍵詞: 動作定位捲積神經網路圖捲積網路自我關注機制
外文關鍵詞: action localization, convolutional neural network, graph convolution, self-attention
相關次數: 點閱:199下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本論文提出了一種基於圖捲積網路(Graph Convolutional Network)的影像動作定位架構。為了生成更準確的動作定位,首先採用三維的捲積神經網路(3D-CNN)來做基礎的動作偵測。接著,利用雙向自我關注機制和圖捲積網路所構成的動作檢測網路,建立行動者與行動者之間的時間與空間關係以改進原始的檢測結果。為了達到此效果,我們藉由雙向自我關注機制學習相鄰行動者之間的特徵相似性,以建立圖(Graph)的節點(node)和邊(edge),再以此圖輸入圖捲積網路。因此,藉由這個架構,可以利用圖捲積網路強化行動者及其鄰近對象的結構關係,以及結合學習非局部依賴性的自我注意機制,從而產生更準確的動作定位。不同於現有的深度網絡,我們的方法可以依據時間與空間的資訊以及行動者的相互關係構建出圖,不需要為圖捲積網路額外提供定義好的圖。實驗結果表明此方法在公開的UCF-24和J-HMDB以及AVA數據集上是有效的。


In this thesis, we propose an effective graph-based approach for action localization. The new approach first uses a 3D convolutional neural network (3D-CNN) for preliminary action detection. Subsequently, a new Hierarchical self-attention with Graph convolution Network (HiGN), consisting of multi-level bidirectional self-attention and graph convolution layers, is devised to refine the action detection based on the temporal and spatial relationships of the actors. Toward this end, new spatial-temporal graphs, where each node represents either an actor or an object, are constructed based on the feature similarity learned by the bidirectional self-attention. As a result, the established architecture can harness the power of graph modelling to capture the relationships of the actors and their neighboring objects and the strength of self-attention in learning the non-local dependency, leading to more accurate action localization. As opposed to the existing deep networks on graphs, HiGN can model the actors' interactions based on the spatial-temporal graphs without prior rules. Simulations showcase the efficacy of the new network on three commonly used UCF-24 and J-HMDB and AVA datasets.

摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . .iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . .vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Object detection . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Action recognition . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . .5 2.4 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . .5 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . .7 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . .7 3.2 Action Detection Network . . . . . . . . . . . . . . . . . . . . 8 3.3 Hierarchical Self-Attention with Graph Convolution Network . . . 8 3.3.1 Bidirectional Self-Attention Unit . . . . . . . . . . . . . . 10 3.3.2 Graph Convolutional Networks . . . . . . . . . . . . . . . . .11 3.4 Tube Association Network . . . . . . . . . . . . . . . . . . . .13 3.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .19 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 UCF-24 . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 4.1.2 J-HMDB . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 4.1.3 AVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . .20 4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . .21 4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . .22 4.4.1 Impact of different predict way . . . . . . . . . . . . . . . 22 4.4.2 Impact of the GCN layer . . . . . . . . . . . . . . . . . . . 27 4.5 Comparisons with State-of-the-Art Methods . . . . . . . . . . . 31 4.6 Successful Cases and Error Analysis . . . . . . . . . . . . . . 34 4.6.1 UCF-24 Dataset . . . . . . . . . . . . . . . . . . . . . . . .34 4.6.2 J-HMDB Dataset . . . . . . . . . . . . . . . . . . . . . . . .35 4.6.3 AVA Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . 45 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .. 45 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 45 Appendix A : Example images from the datasets . . . . . . . . . . . 46 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

[1] J. Zhao and C. G. Snoek, “Dance with flow: Two-in-one stream action
detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 9935–9944, 2019.
[2] E. H. P. Alwando, Y.-T. Chen, and W.-H. Fang, “Cnn-based multiple path
search for action tube detection in videos,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 30, no. 1, pp. 104–116, 2018.
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Proceedings of the Advances
in Neural Information Processing Systems, pp. 91–99, 2015.
[4] X. Yang, X. Yang, M.-Y. Liu, F. Xiao, L. S. Davis, and J. Kautz, “Step:
Spatio-temporal progressive learning for video action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 264–272, 2019.
[5] L. Song, S. Zhang, G. Yu, and H. Sun, “Tacnet: Transition-aware
context network for spatio-temporal action detection,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 11987–11995, 2019.
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings
of the Advances in Neural Information Processing Systems, pp. 5998–6008,
2017.
[7] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al., “AVA: A video
dataset of spatio-temporally localized atomic visual actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 6047–6056, 2018.
[8] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal
feature learning: Speed-accuracy trade-offs in video classification,” in Proceedings of the European conference on computer vision (ECCV), pp. 305–
321, 2018.
[9] C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and
C. Schmid, “Actor-centric relation network,” in Proceedings of the European
Conference on Computer Vision (ECCV), pp. 318–334, 2018.
[10] Z. Qiu, T. Yao, C.-W. Ngo, X. Tian, and T. Mei, “Learning spatiotemporal representation with local and global diffusion,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 12056–12065, 2019.
[11] Y. Li, W. Lin, J. See, N. Xu, S. Xu, K. Yan, and C. Yang, “Cfad: Coarseto-fine action detector for spatiotemporal action localization,” in European
Conference on Computer Vision, pp. 510–527, Springer, 2020.
[12] Y. Li, Z. Wang, L. Wang, and G. Wu, “Actions as moving points,” in European Conference on Computer Vision, pp. 68–84, Springer, 2020.
[13] P. Mettes and C. G. Snoek, “Spatial-aware object embeddings for zero-shot
localization and classification of actions,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4443–4452, 2017.
[14] Y. Zhang, P. Tokmakov, M. Hebert, and C. Schmid, “A structured model for
action detection,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 9975–9984, 2019.
[15] R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Hierarchical self-attention
network for action localization in videos,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 61–70, 2019.
[16] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2117–2125,
2017.
[17] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “Light-head r-cnn:
In defense of two-stage object detector,” arXiv preprint arXiv:1711.07264,
2017.
[18] G. Song, Y. Liu, and X. Wang, “Revisiting the sibling head in object detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 11563–11572, 2020.
[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in Proceedings of the European
Conference on Computer Vision, pp. 21–37, 2016.
[20] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 7263–7271, 2017.
[21] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
anchor-based and anchor-free detection via adaptive training sample selection,” in Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pp. 9759–9768, 2020.
[22] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for
action recognition in videos,” arXiv preprint arXiv:1406.2199, 2014.
[23] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1933–1941, 2016.
[24] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proceedings of
the IEEE international conference on computer vision, pp. 4489–4497, 2015.
[25] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
[26] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[27] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y. Bengio, “Show, attend and tell: Neural image caption generation with
visual attention,” in International conference on machine learning, pp. 2048–
2057, PMLR, 2015.
[28] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to
scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 3640–3649, 2016.
[29] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention network for action recognition in videos,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1347–1360, 2017.
[30] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 244–253, 2019.
[31] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[32] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio,
“Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[33] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning
on large graphs,” in Proceedings of the 31st International Conference on
Neural Information Processing Systems, pp. 1025–1035, 2017.
[34] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan,
“Graph convolutional networks for temporal action localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision,
pp. 7094–7103, 2019.
[35] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem, “G-tad: Sub-graph
localization for temporal action detection,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 10156–10165,
2020.
[36] R. R. A. Pramono, Y. T. Chen, and W. H. Fang, “Spatial-temporal action localization with hierarchical self-attention,” To be published in IEEE
Transactions on Multimedia, pp. 1–1, 2021.
[37] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
[38] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs retrace
the history of 2D CNNs and imagenet?,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555, 2018.
[39] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense
object detection,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 2980–2988, 2017.
[40] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei, “Recurrent tubelet proposal and
recognition networks for action detection,” in Proceedings of the European
conference on computer vision (ECCV), pp. 303–318, 2018.
[41] Y. Li, W. Lin, T. Wang, J. See, R. Qian, N. Xu, L. Wang, and S. Xu,
“Finding action tubes with a sparse-to-dense framework,” in Proceedings of
the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11466–11473,
2020.

無法下載圖示 全文公開日期 2024/09/15 (校內網路)
全文公開日期 2031/09/15 (校外網路)
全文公開日期 2031/09/15 (國家圖書館:臺灣博碩士論文系統)
QR CODE