簡易檢索 / 詳目顯示

研究生: 陳柏勳
Po-Hsun Chen
論文名稱: 藉由使用音訊和聊天訊息之多階段架構偵測直播影片精彩片段
Live-Stream Highlight Detection Through Multi-Stage Architecture Using Audio and Chat Messages
指導教授: 戴碧如
Bi-Ru Dai
口試委員: 陳怡伶
Yi-Ling Chen
戴志華
Chih-Hua Tai
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 59
中文關鍵詞: 多階段架構影片精彩片段偵測局部注意力直播
外文關鍵詞: Multi-Stage Architecture, Video Highlight Detection, Local Attention, Live-Stream
相關次數: 點閱:154下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著直播服務的快速擴張,大量未經剪輯的影片隨之而出。爲了快速的找到觀眾感興趣的片段,能夠自動偵測精彩片段的方法變得非常重要。近年來有很多實驗致力於精彩片段偵測,但因直播影片的長度通常都長達數個小時而且影片內容的類型不會保持不變,所以基於影像的方法會因硬體上的限制而無法有效的偵測精彩片段。因此我們採用了多階段的架構由淺入深的偵測精彩片段,在第一階段先對影片做初步的過濾,而第二個階段則是使用多模態資訊進而精準的偵測精彩片段。由實驗結果顯示,結合音訊資訊以及聊天資訊能夠使模型更精準的偵測精彩片段,而多階段的架構比起單一模型能夠更有效率和更精準的偵測精彩片段。


    With the rapid expansion of the live-streaming service, a huge number of unedited videos are coming out. In order to locate the interesting clip for the audience quickly, it becomes very important to be able to detect the highlight automatically. However, the duration of the live-streaming video is usually several hours and the type of content does not keep the same. Hence, the video-based method cannot effectively detect the highlight due to hardware limitation. Therefore, we proposed a multi-stage architecture to detect the highlight from shallow to deep, with initial filtering of the video in the first stage and then using multi-modal information to detect the highlight precisely in the second stage. The experimental results show that combining audio and chat information enables the model to detect the highlight more accurately, and the multi-stage architecture can detect the highlight more efficiently and precisely than a single model.

    Recommendation Letter Approval Letter Abstract in Chinese Abstract in English Acknowledgements Contents List of Figures List of Tables 1 Introduction 2 RelatedWork 3 Definitions 4 ProposedModel 4.1 FeatureExtraction 4.1.1 ChatInformation 4.1.2 AudioInformation 4.2 Multi-StageArchitecture 4.2.1 FiltrationStage 4.2.2 DetectionStage 5 Experiments 5.1 Dataset 5.2 ExperimentalSetups 5.3 EvaluationMetrics 5.4 ComparisonMethods 5.5 ExperimentalResults 5.5.1 ParametersSelection 5.5.2 PerformanceResults 5.6 AblationStudy 5.6.1 Effectiveness of Different Audio Features 5.6.2 Different Candidate Highlight Video Handling Methods 5.6.3 Effectiveness of Different Audio Features . . . . . 38 6 Conclusions References LetterofAuthority

    [1] T. Badamdorj, M. Rochan, Y. Wang, and L. Cheng, “Joint visual and audio learning for video highlight detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8127– 8137, 2021.
    [2] Y. Liu, S. Li, Y. Wu, C.-W. Chen, Y. Shan, and X. Qie, “Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3042–3051, 2022.
    [3] C.-M. Liaw and B.-R. Dai, “Live stream highlight detection using chat messages,” in 2020 21st IEEE International Conference on Mobile Data Management (MDM), pp. 328–332, IEEE, 2020.
    [4] J. Wang, C. Xu, E. Chng, and Q. Tian, “Sports highlight detection from keyword sequences using hmm,” in 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 1, pp. 599–602, IEEE, 2004.
    [5] A. Raventós, R. Quijada, L. Torres, F. Tarrés, E. Carasusán, and D. Giribet, “The importance of audio descriptors in automatic soccer highlights generation,” in 2014 IEEE 11th International Multi- Conference on Systems, Signals & Devices (SSD14), pp. 1–6, IEEE, 2014.
    [6] M. Sanabria, F. Precioso, and T. Menguy, “Hierarchical multimodal attention for deep video sum- marization,” in 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7977–7984, IEEE, 2021.
    [7] C.-C. Cheng and C.-T. Hsu, “Fusion of audio and motion information on hmm-based highlight ex- traction for baseball games,” IEEE Transactions on Multimedia, vol. 8, no. 3, pp. 585–599, 2006.
    [8] Y. Lee, H. Jung, C. Yang, and J. Lee, “Highlight-video generation system for baseball games,” in 2020 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), pp. 1–4, IEEE, 2020.
    [9] S. Jai-Andaloussi, A. Mohamed, N. Madrane, and A. Sekkaki, “Soccer video summarization using video content analysis and social media streams,” in 2014 IEEE/ACM International Symposium on Big Data Computing, pp. 1–7, IEEE, 2014.
    [10] M. Rochan, M. K. Krishna Reddy, L. Ye, and Y. Wang, “Adaptive video highlight detection by learn- ing from user history,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 261–278, Springer, 2020.
    [11] K. Mundnich, A. Fenster, A. Khare, and S. Sundaram, “Audiovisual highlight detection in videos,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4155–4159, IEEE, 2021.
    [12] E. Chu and D. Roy, “Audio-visual sentiment analysis for learning emotional arcs in movies,” in 2017 IEEE International Conference on Data Mining (ICDM), pp. 829–834, IEEE, 2017.
    [13] H.-K. Han, Y.-C. Huang, and C. C. Chen, “A deep learning model for extracting live streaming video highlights using audience messages,” in Proceedings of the 2019 2nd Artificial Intelligence and Cloud Computing Conference, pp. 75–81, 2019.
    [14] R. Jiang, C. Qu, J. Wang, C. Wang, and Y. Zheng, “Towards extracting highlights from recorded live videos: An implicit crowdsourcing approach,” in 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 1810–1813, IEEE, 2020.
    [15] C.-Y. Fu, J. Lee, M. Bansal, and A. Berg, “Video highlight prediction using audience chat reactions,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 972– 978, 2017.
    [16] Q. Ping, “Video recommendation using crowdsourced time-sync comments,” in Proceedings of the 12th ACM Conference on Recommender Systems, pp. 568–572, 2018.
    [17] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
    [18] S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
    [19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, vol. 26, 2013.
    [20] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword informa- tion,” Transactions of the association for computational linguistics, vol. 5, pp. 135–146, 2017.
    [21] A. Joulin, É. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431, 2017.
    [22] C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
    [23] G. Richard, S. Sundaram, and S. Narayanan, “An overview on perceptually motivated audio indexing and classification,” Proceedings of the IEEE, vol. 101, no. 9, pp. 1939–1954, 2013.
    [24] Z. Rafii and B. Pardo, “Music/voice separation using the similarity matrix.,” in ISMIR, pp. 583–588, 2012.
    [25] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al., “Cnn architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal processing (icassp), pp. 131–135, IEEE, 2017.
    [26] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
    [27] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, vol. 8, pp. 18–25, 2015.
    [28] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE inter- national conference on acoustics, speech and signal processing (ICASSP), pp. 776–780, IEEE, 2017.
    [29] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
    [30] C. Ringer, M. A. Nicolaou, and J. A. Walker, “Autohighlight: Highlight detection in league of legends esports broadcasts via crowd-sourced data,” Machine Learning with Applications, vol. 9, p. 100338, 2022.
    [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014.

    無法下載圖示 全文公開日期 2033/02/13 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 2113/02/13 (國家圖書館:臺灣博碩士論文系統)
    QR CODE