簡易檢索 / 詳目顯示

研究生: 黃耀邦
Yao-Bang Huang
論文名稱: 動態特徵融合之圖內捲神經網路應用於人體骨架行為辨識
Graph Involutional Networks with Dynamic Feature Fusion for Skeleton-Based Action Recognition
指導教授: 花凱龍
Kai-Lung Hua
口試委員: 鍾國亮
Kuo-Liang Chung
陳永耀
Yung-Yao Chen
陳怡伶
Yi-Ling Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 45
中文關鍵詞: 圖內捲神經網路動態特徵融合基於人體骨架的行為辨識
外文關鍵詞: Graph involution networks, Dynamic feature fusion, Skeleton-based action recognition
相關次數: 點閱:141下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在基於人體骨架的行為識別領域中,圖卷積一直是當代神經網路架構的基礎算子。在過去的相關作法中,圖卷積的運算邏輯會專注在聚合局部特徵,並從不同語義的特徵重現學習出具備鑑別度的動態模式。而這種局部處理的作法,在處理遠距離節點關聯性特徵時( 例如:拍手行為 ),其實並不見得有效。此外,不同語義的特徵重現通常透過非常直接的操作來進行組合,這可能會造成神經網路的性能產生一種瓶頸。為了緩解前述的問題,我們提出了一種新穎的圖內卷算子 ( Graph Involution , GI ) 和一個動態特徵融合 ( Dynamic Feature Fusion, DFF ) 的模組。圖內卷算子可用來捕獲更豐富的節點關聯性;而動態特徵融合模組可在擴大神經網路感知域的同時,自適應地融合不同語義的特徵重現。我們利用圖內卷算子和動態特徵融合模組構建了一個有效的特徵抽取器,並稱之為「 DFF-GIN 」,它在此領域的兩個基準資料集(NTU RGB+D 60、NTU RGB+D 120)上,皆取得了有競爭力的結果。


    Graph convolution has been the fundamental operator of contemporary network architecture for skeleton-based action recognition tasks. In previous approaches, graph convolution focuses on aggregating local features and learning the discriminative motion pattern from different semantic representations. However, these local processing methods are inefficient for capturing long-range dependencies between distant nodes, such as clapping. Moreover, the combination of different semantic representations is usually implemented through direct operations, which can become a bottleneck for the network performance. To alleviate the above issues, we propose a novel Graph Involution (GI) operator for capturing richer dependencies and a Dynamic Feature Fusion (DFF) module to enlarge the receptive fields and adaptively fuse the different semantic representations. We leverage the GI operator and DFF module to construct an effective feature extractor, DFF-GIN, which achieves comparable results on two benchmark datasets (NTU RGB+D 60, NTU RGB+D 120) for skeleton-based action recognition.

    Recommendation Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .i Approval Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . iii Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Graph Convolution for Skeleton Action Recognition. . . . . . . . . . . . . . . . . 5 2.2 Multi-scale Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Feature Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.1 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.3 Graph Convolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Dynamic Feature Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 3.3 Graph Involution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 3.4 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 4 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 4.1.1 NTU RGB+D 60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 4.1.2 NTU RGB+D 120. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 4.3.1 Configurations of the Dynamic Feature Fusion Module. . . . . . . . . . . . . .19 4.3.2 Configurations of the Graph Involution Operator. . . . . . . . . . . . . . . .21 4.4 Comparisons with State of the Arts . . . . . . . . . . . . . . . . . . . . . . . .23 4.5 Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 5 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Letter of Authority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    [1] R. Poppe, “A survey on vision-based human action recognition,” Image and Vision Computing, vol. 28, no. 6, pp. 976–990, 2010.
    [2] J. Aggarwal and M. Ryoo, “Human activity analysis: A review,” ACM Comput. Surv., vol. 43, apr 2011.
    [3] M. Sudha, K. Sriraghav, S. Abisheck, S. Jacob, and S. Manisha, “Approaches and applications of virtual reality and gesture recognition: A review,” International Journal of Ambient Computing and Intelligence, vol. 8, pp. 1–18, 10 2017.
    [4] S. K. Yadav, K. Tiwari, H. M. Pandey, and S. A. Akbar, “A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions,”Knowledge-Based Systems, vol. 223, p. 106970, 2021.
    [5] F. Han, B. Reily, W. Hoff, and H. Zhang, “Space-time representation of people based on 3d skeletal data: A review,” Computer Vision and Image Understanding, vol. 158, pp. 85–105, 2017.
    [6] Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, “Human action recognition from various data modalities: A review,” arXiv preprint arXiv:2012.11866, 2020.
    [7] G. Johansson, “Visual perception of biological motion and a model for its analysis,” Perception & Psychophysics, vol. 14, pp. 201–211, Jun 1973.
    [8] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1110–1118, 2015.
    [9] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, p. 3697–3703, AAAI Press, 2016.
    [10] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “An end-to-end spatio-temporal attention model for human action recognition from skeleton data,” in AAAI Conference on Artificial Intelligence, pp. 4263–4270, 2017.
    [11] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126, 2017.
    [12] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan, “Skeleton-based action recognition with spatial reasoning and temporal stack learning,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–118, 2018.
    [13] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297, 2017.
    [14] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, vol. 68, 03 2017.
    [15] C. Xie, C. Li, B. Zhang, C. Chen, J. Han, and J. Liu, “Memory attention networks for skeleton-based action recognition,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 1639–1645, International Joint Conferences on Artificial Intelligence Organization, 7 2018.
    [16] J. Weng, M. Liu, X. Jiang, and J. Yuan, “Deformable pose traversal convolution for 3d action and gesture recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
    [17] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive neural networks for high performance skeleton-based human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
    [18] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Thirty-second AAAI conference on artificial intelligence, 2018.
    [19] L. Chaolong, C. Zhen, Z. Wenming, X. Chunyan, and Y. Jian, “Spatio-temporal graph convolution for skeleton based action recognition,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    [20] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition with directed graph neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
    [21] C. Si, W. Chen, W. Wang, L. Wang, and T. Tan, “An attention enhanced graph convolutional lstm network for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236, 2019.
    [22] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in CVPR, 2019.
    [23] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-structural graph convolutional networks for skeleton-based action recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
    [24] Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition,” in Proceedings of the 28th ACM International Conference on Multimedia (ACMMM), (New York, NY, USA), pp. 1625–1633, Association for Computing Machinery, 2020.
    [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    [26] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 6105–6114, PMLR, 09–15 Jun 2019.
    [27] Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Constructing stronger and faster baselines for skeleton-based action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
    [28] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019, 2016.
    [29] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2684–2701, 2020.
    [30] L. Xia, C.-C. Chen, and J. K. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27, 2012.
    [31] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595, 2014.
    [32] B. Fernando, E. Gavves, M. José Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling video evolution for action recognition,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5378–5387, 2015.
    [33] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152, 2020.
    [34] Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” in CVPR, 2021.
    [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2015.
    [36] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995, 2017.
    [37] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
    [38] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807, 2017.
    [39] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141, 2018.
    [40] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2403–2412, 2018.
    [41] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269, 2017.
    [42] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “Cspnet: A new backbone that can enhance learning capability of cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
    [43] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11531–11539, 06 2020.
    [44] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 2, pp. 652–662, 2021.
    [45] D. Li, J. Hu, C. Wang, X. Li, Q. She, L. Zhu, T. Zhang, and Q. Chen, “Involution: Inverting the inherence of convolution for visual recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
    [46] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. Ver Steeg, and A. Galstyan, “Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing,” in international conference on machine learning, pp. 21–29, PMLR, 2019.
    [47] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning, pp. 6861–6871, PMLR, 2019.
    [48] R. Liao, Z. Zhao, R. Urtasun, and R. Zemel, “Lanczosnet: Multi-scale deep graph convolutional networks,” in International Conference on Learning Representations, 2019.
    [49] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, “Attentional feature fusion,” in IEEE Winter Conference on Applications of Computer Vision, WACV 2021, 2021.
    [50] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, (Madison, WI, USA), p. 807–814, Omnipress, 2010.
    [51] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning (F. Bach and D. Blei, eds.), vol. 37 of Proceedings of Machine Learning Research, (Lille, France), pp. 448–456, PMLR, 07–09 Jul 2015.
    [52] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019.
    [53] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv: Learning, 2017.
    [54] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” 2018.
    [55] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in European conference on computer vision, pp. 816–833, Springer, 2016.
    [56] Y.-F. Song, Z. Zhang, and L. Wang, “Richly activated graph convolutional network for action recognition with incomplete skeletons,” in International Conference on Image Processing (ICIP), IEEE, 2019.
    [57] Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Richly activated graph convolutional network for robust skeleton-based action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1915–1925, 2021.
    [58] P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
    [59] J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, and A. C. Kot, “Skeleton-based online action prediction using scale selection network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 6, pp. 1453–1467, 2020.
    [60] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929, 2016.

    無法下載圖示 全文公開日期 2024/08/08 (校內網路)
    全文公開日期 2024/08/08 (校外網路)
    全文公開日期 2024/08/08 (國家圖書館:臺灣博碩士論文系統)
    QR CODE