Author: |
黃冠汯 Kuan-Hung Huang |
---|---|
Thesis Title: |
基於自學習特徵對齊之半監督式動作辨識 STRA: Self-Training Representation Alignment for Semi-Supervised Action Recognition |
Advisor: |
花凱龍
Kai-Lung Hua |
Committee: |
項天瑞
Tien-Ruey Hsiang 鐘國亮 Kuo-Liang Chung 郭景明 Jing-Ming Guo 陳永耀 Yung-Yao Chen |
Degree: |
碩士 Master |
Department: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
Thesis Publication Year: | 2021 |
Graduation Academic Year: | 109 |
Language: | 英文 |
Pages: | 47 |
Keywords (in Chinese): | 半監督是學習 、動作辨識 、自學習 、圖卷積 |
Keywords (in other languages): | Semi-supervised Learning, Action Recognition, Self-Training, Graph Convolutional Network |
Reference times: | Clicks: 542 Downloads: 0 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
近幾年來,基於骨架的動作辨識技術越來越進步,大多數的技術能取
得如此卓越成果是藉助於標註完善的資料集。然而,在現實世界中建立如
此大型的資料集所需要花費的成本是很昂貴的。此外,建立資料集的同時
可能會有蒐集到不完整資料的情況,例如,骨架中某些關節點座標的缺失
或是某幾幀的骨架缺失。因為完整和不完整的骨架資料之間存在著不小的
差異性,導致訓練好的模型部署在實際應用場景時,模型表現會與訓練時
產生落差。因此我們提出了以下兩種方法:(1) 我們提出了一個新的自訓
練架構,通過生成偽標籤來減少標籤資料的使用。我們的架構能進一步保
證偽標籤資料在訓練時能充分收斂。(2) 我們提出特徵對齊模塊,該模塊
採用一致性正規化技術最小化關節缺失以及骨架缺失在實際應用上對模型
的影響。我們提出的方法,STRA,不僅能提高模型在少量標籤資料下的
性能,並且在關節缺失或是骨架缺失的情況下也可以得到相近的成果。我
們也在 NTU 和 NUCLA 兩個資料集上進行驗證,並且與最新的幾種方法
進行比較。
Most existing skeleton-based action recognition models leverage large
labeled datasets to achieve great results. However, procuring a large amount
of labeled skeleton data in real-world scenarios to enable those models is
costly. Furthermore, missing joints and missing frames problems commonly occur during data collection. These missing joints and frames cause
problems during testing due to the representational differences between
complete and incomplete skeletal data. To address these problems, we propose two functionalities: (1) We propose a new self-training framework
that reduces labeled skeleton data usage by generating pseudo-labels. Our
framework can take small amounts of labeled data and generate pseudo-labels enough to guarantee model convergence; (2) We propose a representation alignment module that adopts consistency regularization to minimize the effect of missing joints and frames. Our proposed method, STRA,
not only improves the performance of GCN models with only a minimal
amount of labeled data but also achieves similar performance under conditions with missing joints and frames. We evaluate our method on the NTU
and NUCLA datasets against state-of-th-eart works.
[1] Yong Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action
recognition,” in CVPR, 2015.
[2] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive neural networks for high
performance skeletonbased human action recognition,” in IEEE PAMI, 2019.
[3] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeletonbased
action recognition,” in AAAI, 2018.
[4] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actionalstructural graph convolutional
networks for skeletonbased action recognition,” in CVPR, 2019.
[5] P. Zhang, C. Lan, W. Zeng, J. Xue, and N. Zheng, “Semanticsguided neural networks for efficient
skeletonbased human action recognition,” in CVPR, 2019.
[6] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Twostream adaptive graph convolutional networks for
skeletonbased action recognition,” in CVPR, 2019.
[7] K. Matthew and L. Xin, “Ddgcn: A dynamic directed graph convolutional network for action recognition,” in ECCV, 2020.
[8] Y.F. Song, Z. Zhang, C. Shan, and L. Wang, “Stronger, faster and more explainable: A graph convolutional baseline for skeletonbased action recognition,” in ACM Multimedia, 2020.
[9] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions
for skeletonbased action recognition,” in CVPR, 2020.
[10] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Symbiotic graph neural networks for 3d
skeletonbased human action recognition and motion prediction,” in IEEE PAMI, 2021.
[11] W. Peng, X. Hong, and G. Zhao, “Tripool: Graph triplet pooling for 3d skeletonbased action recognition,” in Pattern Recognition, 2021.
[12] W. Peng, J. Shi, and G. Zhao, “Spatial temporal graph deconvolutional network for skeletonbased
human action recognition,” in IEEE Signal Processing Letters, 2021.
[13] J. Xie, W. Xin, R. Liu, L. Sheng, X. Liu, X. Gao, S. Zhong, L. Tang, and Q. Miao, “Crosschannel
graph convolutional networks for skeletonbased action recognition,” in IEEE Access, 2021.
[14] N. Heidari and A. Iosifidis, “Temporal attentionaugmented graph convolutional network for efficient
skeletonbased human action recognition,” in ICPR, 2021.
[15] S. Chen, K. Xu, X. Jiang, and T. Sun, “Spatiotemporalspectral graph convolutional networks for
skeleton-based action recognition,” in ICMEW, 2021.
31
[16] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “Openpose: Realtime multi-person
2d pose estimation using part affinity fields,” in IEEE PAMI, 2019.
[17] A. Shahroudy, J. Liu, T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity
analysis,” in CVPR, 2016.
[18] C. Si, W. Chen, W. Wang, L. Wang, and T. Tan, “An attention enhanced graph convolutional lstm
network for skeleton-based action recognition,” in CVPR, 2019.
[19] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for
skeleton-based action recognition,” in CVPR, 2019.
[20] O. Chapelle, B. Scholkopf, and A. Zien, Eds., “Semi-supervised learning (chapelle, o. et al., eds.;
2006) [book reviews],” 2009.
[21] S. P. Sahoo, S. Ari, and U. Srinivasu, “3d features for human action recognition with semi-supervised
learning,” in IET Image Processing, 2019.
[22] M. F. Mabrouk, N. M. Ghanem, and M. A. Ismail, “Semi supervised learning for human activity
recognition using depth cameras,” in ICMLA, 2015.
[23] S. Wang, Z. Ma, Y. Yang, X. Li, C. Pang, and A. G. Hauptmann, “Semisupervised multiple feature
analysis for action recognition,” in IEEE Transactions on Multimedia, 2014.
[24] A. Singh, O. Chakraborty, A. Varshney, R. Panda, R. Feris, K. Saenko, and A. Das, “Semi-supervised
action recognition with temporal contrastive learning,” in CVPR, 2021.
[25] D. Lee, “Pseudo-label : The simple and efficient semi-supervised learning method for deep neural
networks,” in ICML Workshop on Challenges in Representation Learning, 2013.
[26] P. CascanteBonilla, F. Tan, Y. Qi, and V. Ordonez, “Curriculum labeling: Revisiting pseudo-labeling
for semi-supervised learning,” in NeuraIPS, 2021.
[27] K. Sohn, D. Berthelot, C.L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” in NeuraIPS,
2020.
[28] J. wang, X. Nie, Y. Xia, Y. Wu, and S.C. Zhu, “Crossview action modeling, learning and recognition,” in CVPR, 2014.
[29] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep highresolution representation learning for human pose
estimation,” in CVPR, 2019.
[30] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake,
“Realtime human pose recognition in parts from single depth images,” in CVPR, 2011.
[31] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan, “Skeleton-based action recognition with spatial reasoning and temporal stack learning,” in ECCV, 2018.
32
[32] P. Bachman, O. Alsharif, and D. Precup, “Learning with pseudo-ensembles,” in NeurIPS, 2014.
[33] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning.,” in ICLR, 2017.
[34] M. Sajjadi, M. Javanmardi, and T. Tasdizen, “Regularization with stochastic transformations and perturbations for deep semi-supervised learning,” in NeurIPS, 2016.
[35] C. Wei, K. Shen, Y. Chen, and T. Ma, “Theoretical analysis of self-training with deep networks on
unlabeled data,” in ICLR, 2021.
[36] W. Shi, Y. Gong, C. Ding, Z. M. Tao, and N. Zheng, “Transductive semisupervised deep learning
using minmax features,” in ECCV, 2018.
[37] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for deep semi-supervised learning,”
in CVPR, 2019.
[38] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudolabeling and confirmation bias in deep semi-supervised learning,” in ICLR, 2020.
[39] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz, “mixup: Beyond empirical risk minimization,”
in ICLR, 2018.
[40] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in ICML, 2009.
[41] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.
[42] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatiotemporal lstm with trust gates for 3d human action
recognition,” in ECCV, 2016.
[43] T. Miyato, S. ichi Maeda, M. Koyama, and S. Ishii, “Virtual adversarial training: A regularization
method for supervised and semi-supervised learning,” in IEEE PAMI, 2018.
[44] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4l: Self-supervised semisupervised learning,” in
ICCV, 2019.
[45] C. Si, X. Nie, W. Wang, L. Wang, T. Tan, and J. Feng, “Adversarial selfsupervised learning for semi-supervised 3d action recognition,” in ECCV, 2020.
[46] L. Lin, S. Song, W. Yang, and J. Liu, “Ms2l: Multitask self-supervised learning for skeleton based
action recognition,” in ACM Multimedia, 2020.