簡易檢索 / 詳目顯示

研究生: 陳尚富
Shang-Fu Chen
論文名稱: 基於動作分割的特徵表示與邊界增強
Representation and Boundary Enhancement for Action Segmentation using Transformer
指導教授: 花凱龍
Kai-Lung Hua
口試委員: 曹昱
Yu Tsao
陳駿丞
Jun-Cheng Chen
陳宜惠
Yi-Hui Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 41
中文關鍵詞: 動作分割影片理解
外文關鍵詞: Action Segmentation, Video Understanding
相關次數: 點閱:142下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在動作分割 (action segmentation) 的任務中,我們需要將一長段未剪輯的影片分割成一系列的動作片段。通常,時序模型在這個任務上被廣泛使用。近期Transformer的方法超越了過往的時序卷積網絡(TCNs)的綜合表現。但不論是時序卷積網絡還是 Transformer,都會面臨到過度分割(over-segmentation)的問題。在過往的方中,大多會使用後處理的方式來改善這個問題,但這些後處理方法並不適用於每個模型,有時反而會導致表現變差。在這篇論文中,我們認為直接去改善模型的學習能力會比後處理來的有效,因此,我們針對過度分割的問題提出了一系列的損失函數來增強表示學習,同時使用多任務學習的方法來加強模型對邊界的學習能力。透過實驗驗證,我們的方法在兩個常見的公開資料集 50salads 和 Georgia Tech Egocentric Activities(GTEA) 數據集上相較於之前的 Transformer 方法有顯著的提升,尤其在解決過度分割的問題上。


In the task of action segmentation, the goal is to partition a long untrimmed video into a series of action segments. Recently, Transformer-based methods have surpassed the overall performance of the previous temporal convolutional networks (TCNs). However, both TCNs and Transformers face the challenge of over-segmentation, where the video is excessively segmented into smaller action units. Previous approaches often relied on post-processing techniques to mitigate this issue, but such methods are not universally applicable to every model and can sometimes lead to degraded performance. Therefore, in this paper, we propose a set of loss functions to enhance representation learning and employ a multi-task learning approach to strengthen the model's ability to learn action boundaries. Through extensive experiments, we validate that our method exhibits significant improvements over previous Transformer-based approaches, particularly in resolving the challenge of over-segmentation. These improvements were observed on two commonly used public datasets as well as the 50Salads and Georgia Tech Egocentric Activities (GTEA) datasets.

論文摘要........................................ I Abstract ........................................ II Acknowledgement .................................. III Contents........................................ IV ListofFigures..................................... V List of Tables ................................. VII 1 Introduction................................. 1 2 RelatedWork ................................ 4 3 ProposedMethod.............................. 6 3.1 TransitionSmoothing........................ 9 3.2 Over-segmentationReduction ................... 10 3.3 ActionBoundaryEnhancement .................. 13 4 Experiments................................. 15 4.1 EvaluationDatasets......................... 15 4.2 EvaluationMetrics ......................... 16 4.3 ImplementationDetails....................... 16 4.4 QualitativeEvaluation ....................... 17 4.5 AblationStudy ........................... 23 5 Conclusions................................. 25 References....................................... 26 授權書......................................... 32

[1] F. Yi, H. Wen, and T. Jiang, “Asformer: Transformer for action segmentation,” ArXiv, vol. abs 2110.08568, 2021.
[2] Y. Ishikawa, S. Kasai, Y. Aoki, and H. Kataoka, “Alleviating over-segmentation errors by detecting action boundaries,” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2321–2330, 2020.
[3] C.-L. Yang, H. Tampubolon, A. Setyoko, K.-L. Hua, M. Tanveer, and W. Wei, “Secure and privacy-preserving human interaction recognition of pervasive healthcare monitoring,” IEEE Transactions on Network Science and Engineering, 2022.
[4] Z. Lin, S. Geng, R. Zhang, P. Gao, G. de Melo, X. Wang, J. Dai, Y. Qiao, and H. Li, “Frozen clip models are efficient video learners,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 388–404, Springer, 2022.
[5] T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, “Aim: Adapting image models for efficient video action recognition,” ArXiv, vol. abs/2302.03024, 2023.
[6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ArXiv, vol. abs/2010.11929, 2020.
[7] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” ArXiv, vol. abs/1905.11946, 2019.
[8] K. A. Jangtjik, T.-T. Ho, M.-C. Yeh, and K.-L. Hua, “A cnn-lstm framework for authorship classification of paintings,” 2017 IEEE International Conference on Image Processing (ICIP), pp. 2866–2870, 2017.
[9] C.Szegedy,S.Ioffe,V.Vanhoucke,andA.A.Alemi,“Inception-v4,inception-resnet and the impact of residual connections on learning,” ArXiv, vol. abs/1602.07261, 2016.
[10] S. C. Hidayati, C.-W. You, W.-H. Cheng, and K.-L. Hua, “Learning and recognition of clothing genres from full-body images,” IEEE Transactions on Cybernetics, vol. 48, pp. 1647–1659, 2018.
[11] X. Ding, X. Zhang, Y. Zhou, J. Han, G. Ding, and J. Sun, “Scaling up your kernels to 31×31: Revisiting large kernel design in cnns,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11953–11965, 2022.
[12] Y. ting Zhou, J. B. Dy, S.-C. Hsu, Y.-L. Hsu, C.-L. Yang, and K.-L. Hua, “Ssrface: a face recognition framework against shallow data,” Multimedia Tools and Applications, vol. 82, pp. 18617–18633, 2022.
[13] C. S. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. Hager, “Temporal convolutional networks for action segmentation and detection,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1003–1012, 2016.
[14] N. Behrmann, S. A. Golestaneh, Z. Kolter, J. Gall, and M. Noroozi, “Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation,” in European Conference on Computer Vision, 2022.
[15] D. Singhania, R. Rahaman, and A. Yao, “Coarse to fine multi-resolution temporal convolutional network,” arXiv preprint arXiv:2105.10859, 2021.
[16] Z.Wang,Z.Gao,L.Wang,Z.Li,andG.Wu,“Boundary-awarecascadenetworksfor temporal action segmentation,” in European Conference on Computer Vision, 2020.
[17] Y. Huang, Y. Sugano, and Y. Sato, “Improving action segmentation via graph-based temporal reasoning,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14021–14031, 2020.
[18] H. Ahn and D. Lee, “Refining action segmentation with hierarchical video representations,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16282–16290, 2021.
[19] S. Li, Y. A. Farha, Y. Liu, M.-M. Cheng, and J. Gall, “Ms-tcn++: Multi-stage temporal convolutional network for action segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 6647–6658, 2023.
[20] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3570–3579, 2019.
[21] S. Karaman, L. Seidenari, and A. Del Bimbo, “Fast saliency based pooling of fisher encoded dense trajectories,” in ECCV THUMOS Workshop, vol. 1, p. 5, 2014.
[22] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in 2012 IEEE conference on computer vision and pattern recognition, pp. 1194–1201, IEEE, 2012.
[23] H. Kuehne, J. Gall, and T. Serre, “An end-to-end generative framework for video segmentation and recognition,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8, IEEE, 2016.
[24] K. Tang, L. Fei-Fei, and D. Koller, “Learning latent temporal structure for complex event detection,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1250–1257, IEEE, 2012.
[25] K.-L. Hua, T.-T. Ho, K. A. Jangtjik, Y.-J. Chen, and M.-C. Yeh, “Artist-based painting classification using markov random fields with convolution neural network,” Multimedia Tools and Applications, vol. 79, pp. 12635–12658, 2020.
[26] H. Pirsiavash and D. Ramanan, “Parsing videos of actions with segmental grammars,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 612–619, 2014.
[27] Q.Shi,L.Cheng,L.Wang,andA.Smola,“Human action segmentation and recognition using discriminative semi-markov models,” International journal of computer vision, vol. 93, pp. 22–32, 2011.
[28] T. Perrett and D. Damen, “Recurrent assistance: Cross-dataset training of lstms on kitchen tasks,” 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 1354–1362, 2017.
[29] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634, 2014.
[30] B. Singh, T. K. Marks, M. J. Jones, O. Tuzel, and M. Shao, “A multi-stream bi-directional recurrent neural network for fine-grained action detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1961–1970, 2016.
[31] J. Y.-H. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694– 4702, 2015.
[32] M.Shahid,J.J.Virtusio,Y.-H.Wu,Y.-Y.Chen,M.Tanveer,K.Muhammad,and K.-L. Hua, “Spatio-temporal self-attention network for fire detection and segmentation in video surveillance,” IEEE Access, vol. PP, pp. 1–1, 2021.
[33] M. Shahid, I. Chien, W. Sarapugdi, L. Miao, and K.-L. Hua, “Deep spatialtemporal networks for flame detection,” Multimedia Tools and Applications, vol. 80, pp. 35297 – 35318, 2020.
[34] L. Ding and C. Xu, “Weakly-supervised action segmentation with iterative soft boundary assignment,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6508–6516, 2018.
[35] P. Lei and S. Todorovic, “Temporal deformable residual networks for action segmentation in videos,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6742–6751, 2018.
[36] Y.-C. Yeh, J. B. Dy, T.-M. Huang, Y.-Y. Chen, and K.-L. Hua, “Vdnet: video deinterlacing network based on coarse adaptive module and deformable recurrent residual network,” Neural Computing and Applications, vol. 34, pp. 12861 – 12874, 2022.
[37] N. Hussein, E. Gavves, and A. W. M. Smeulders, “Timeception for complex action recognition,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 254–263, 2018.
[38] D. Wang, D. Hu, X. Li, and D. Dou, “Temporal relational modeling with self-supervision for action segmentation,” in AAAI Conference on Artificial Intelligence, 2020.
[39] H.Touvron,M.Cord,M.Douze,F.Massa,A.Sablayrolles,andH.J’egou,“Training data-efficient image transformers & distillation through attention,” ArXiv, vol. abs/ 2012.12877, 2020.
[40] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002, 2021.
[41] M. Shahid and K.-L. Hua, “Fire detection using transformer network,” Proceedings of the 2021 International Conference on Multimedia Retrieval, 2021.
[42] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12159–12168, 2021.
[43] R. Strudel, R. G. Pinel, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7242–7252, 2021.
[44] M. Shahid, S.-F. Chen, Y.-L. Hsu, Y.-Y. Chen, Y.-L. Chen, and K.-L. Hua, “Forest fire segmentation via temporal transformer from aerial images,” Forests, 2023.
[45] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, “Vivit: A video vision transformer,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6816–6826, 2021.
[46] C. Zhu, W. Ping, C. Xiao, M. Shoeybi, T. Goldstein, A. Anandkumar, and B. Catanzaro, “Long-short transformer: Efficient transformers for language and vision,” Advances in Neural Information Processing Systems, vol. 34, pp. 17723–17736, 2021.
[47] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” ArXiv, vol. abs/2004.05150, 2020.
[48] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” ArXiv, vol. abs/1901.02860, 2019.
[49] Y.Tay,M.Dehghani,S.Abnar,Y.Shen,D.Bahri,P.Pham,J.Rao,L.Yang,S.Ruder, and D. Metzler, “Long range arena: A benchmark for efficient transformers,” ArXiv, vol. abs/2011.04006, 2020.
[50] M.Zaheer,G.Guruganesh,K.A.Dubey,J.Ainslie,C.Alberti,S.Ontanon,P.Pham, A. Ravula, Q. Wang, L. Yang, et al., “Big bird: Transformers for longer sequences,” Advances in neural information processing systems, vol. 33, pp. 17283–17297, 2020.
[51] S. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, and M.-M. Cheng, “Global2local: Efficient structure search for video action segmentation,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16800–16809, 2021.
[52] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733, 2017.
[53] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” ArXiv, vol. abs/2002.05709, 2020.
[54] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in NIPS, 2005.
[55] S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, 2013.
[56] A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” CVPR 2011, pp. 3281–3288, 2011.
[57] C.S.Lea,A.Reiter,R.Vidal,andG.Hager,“Segmentalspatiotemporalcnnsforfinegrained action segmentation,” in European Conference on Computer Vision, 2016.

無法下載圖示 全文公開日期 2025/08/01 (校內網路)
全文公開日期 2025/08/01 (校外網路)
全文公開日期 2025/08/01 (國家圖書館:臺灣博碩士論文系統)
QR CODE