簡易檢索 / 詳目顯示

研究生: 何嘉斌
HE, JIABIN
論文名稱: 多模態及多標籤的運動精彩片段擷取
Multi-modal, Multi-labeled Sport Highlight Extraction
指導教授: 鮑興國
Hsing-Kuo Pao
口試委員: 李育杰
Yuh-Jye Lee
項天瑞
Tien-Ruey Hsiang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 52
外文關鍵詞: multi-modal learning, multi-label learning, fusion strategy, feature representation
相關次數: 點閱:240下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

科技的發展使多媒體的產生與傳播更加便捷與快速,互聯網上的視頻內容也是與日俱增。如何在龐大的視頻資源中有效地搜尋我們需要的視頻以及在冗長的視頻中快速獲取我們需要的內容,是電腦視覺與視頻理解的重要研究方向。籃球是一項受到廣泛熱愛的運動,因此關於籃球比賽的視頻也不勝枚舉,我們想要辨識籃球比賽視頻中的精彩片段,那麼就可以大大節省觀看視頻的時間,與此同時觀看者也享受到了同樣的樂趣。
籃球比賽的視頻含有很多模態的資訊,比如圖像、聲音、比分以及比賽時間等等,對於分析不同模態的資料已經發展出相應的不同演算法來解決。但我們希望通過融合視頻多模態的特徵,利用更全面和豐富的訊息來更好地辨識籃球精彩片段,對此我們探究了不同模態資料的融合策略。
此外我們還基於會影響精彩程度的因素訓練多標籤模型來提取這些因素的聯合特徵,加入到基於多模態的模型中以至於進一步提升模型效果,我們把它稱作基於多模態多標籤的分類方法。


The development of technology makes the generation and dissemination of multimedia more convenient and fast, and the video on the Internet is also growing with each passing day. How to effectively search for the video we need in a huge video resource and quickly gain the content we need in a lengthy video are important research directions of computer vision and video understanding. Basketball is a widely loved sport, so the video about basketball games is too numerous to enumerate. We want to recognize the highlights in the basketball game video, then viewers can save a lot of time watching videos, while enjoy the same pleasure.
The video of basketball game contains a variety of information such as images, audios, scores and game time, etc. Different algorithms have been developed to analysis different modals data. However, we hope to combine multi-modal characteristics then using more comprehensive and rich information to recognize the basketball highlights. We explore the different fusion strategies, latent features fusion and early features fusion.
In addition, we also train multi-label model based on the factors that affect the highlights to extract the joint features of these factors, and add them to the multi-modal model to further improve the model performance. We call this method multi-modal multi-label based classification.

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Visual-based Classification . . . . . . . . . . . . . . . . . 6 3.1.1 Multi-branch Convolutional Networks . . . . . . . 7 3.1.2 Multi-channel Convolutional Networks . . . . . . 8 3.1.3 3D Convolutional Networks . . . . . . . . . . . . 9 3.1.4 Long-term Recurrent Convolutional Networks . . 11 3.2 Audio-based Classification . . . . . . . . . . . . . . . . . 12 3.3 Multi-modal based Classification . . . . . . . . . . . . . . 14 3.3.1 Latent Features Fusion . . . . . . . . . . . . . . . 15 3.3.2 Early Features Fusion . . . . . . . . . . . . . . . 15 3.4 Multi-modal Multi-label based Classification . . . . . . . 16 4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 23 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 Data Collection . . . . . . . . . . . . . . . . . . . 24 4.1.2 Data Preprocessing . . . . . . . . . . . . . . . . . 24 4.2 Unimodal Model . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Multi-modal Model . . . . . . . . . . . . . . . . . . . . . 29 4.4 Multi-modal Multi-label Model . . . . . . . . . . . . . . . 31 4.5 Highlight Extraction and Evaluation . . . . . . . . . . . . 35 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no.7553, p. 436, 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
[3] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,” Computational intelligence and neuroscience, vol. 2018, 2018.
[4] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014.
[5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634, 2015.
[6] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
[7] K. Lee and D. P. Ellis, “Audio-based semantic concept classification for consumer video,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1406–1416, 2009.
[8] M. Xu, N. C. Maddage, C. Xu, M. Kankanhalli, and Q. Tian, “Creating audio keywords for event detection in soccer video,” in 2003 International Conference on Multimedia and Expo. ICME’03. Proceedings (Cat. No. 03TH8698), vol. 2, pp. II–281, IEEE, 2003.
[9] J. Cao, T. Zhao, J. Wang, R. Wang, and Y. Chen, “Excavation equipment classification based on improved mfcc features and elm,” Neurocomputing, vol. 261, pp. 231–241, 2017.
[10] I. Hong, Y. Ko, H. Shin, and Y. Kim, “Emotion recognition from korean language using mfcc hmm and speech speed,” in The 12th International Conference on Multimedia Information Technology and Applications (MITA2016), pp. 12–15, 2016.
[11] C. T. Duong, R. Lebret, and K. Aberer, “Multimodal classification for analysing social media,” arXiv preprint arXiv:1708.02099, 2017.
[12] T. Hasan, H. Bořil, A. Sangwan, and J. H. Hansen, “Multi-modal highlight generation for sports videos using an information-theoretic excitability measure,” EURASIP Journal on Advances in Signal Processing, vol. 2013, no. 1, p. 173, 2013.
[13] U. G. Mangai, S. Samanta, S. Das, and P. R. Chowdhury, “A survey of decision fusion and feature fusion strategies for pattern classification,” IETE Technical review, vol. 27, no. 4, pp. 293–307, 2010.
[14] E. Podrug and A. Subasi, “Surface emg pattern recognition by using dwt feature extraction and svm classifier,” in The 1st Conference of Medical and Biological Engineering in Bosnia and Herzegovina (CMBEBIH 2015), pp. 13–15, 2015.
[15] D. M. Vo and T. H. Le, “Deep generic features and svm for facial expression recognition,” in 3rd National Foundation for Science and Technology Development Conference on Information and Computer Science (NICS), pp. 80–84, IEEE, 2016.
[16] J. Huang, G. Li, Q. Huang, and X. Wu, “Learning label specific features for multi-label classification,” in 2015 IEEE International Conference on Data Mining, pp. 181–190, IEEE, 2015.
[17] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.
[18] H. Harb and L. Chen, “Highlights detection in sports videos based on audio analysis,” in Proceedings of the Third International Workshop on Content-Based Multimedia Indexing CBMI03, September, pp. 22–24, 2003.
[19] M.-L. Zhang, Y.-K. Li, X.-Y. Liu, and X. Geng, “Binary relevance for multi-label learning: an overview,” Frontiers of Computer Science, vol. 12, no. 2, pp. 191–202, 2018.

無法下載圖示 全文公開日期 2024/07/01 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE