簡易檢索 / 詳目顯示

研究生: 廖傑明
Chieh-Ming Liaw
論文名稱: 藉由聊天訊息偵測串流直播影片之精彩片段
Live-Streaming Video Highlight Detection Using Chat Messages
指導教授: 戴碧如
Bi-Ru Dai
口試委員: 戴志華
Chih-Hua Tai
帥宏翰
Hong-Han Shuai
戴碧如
Bi-Ru Dai
陳怡伶
Yi-Ling Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 41
中文關鍵詞: 實況直播影片精彩片段偵測注意力模型
外文關鍵詞: live stream, video highlight detection, attention model
相關次數: 點閱:202下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

隨著科技不斷進步,資料的傳輸越來越快速,許多新興的服務也隨之產生。作為近年熱門的服務之一──實況直播儼然已成為許多人生活的一部分。不同於一般的電視節目,實況直播有著能夠即時與觀眾互動的特性,觀眾能在各個頻道的聊天室即時地參與討論;同時,實況直播的內容較為冗長、鬆散,相較於一般節目短而緊湊的內容,實況直播的影片時常長達數小時以上,縱使作為近年熱門的服務,如此未經剪輯的影片卻難以吸引社群外的人們。然而長達數小時的影片所產生的龐大資料量,卻讓現存基於影像內容進行自動偵測、剪輯精彩片段的方法,面臨許多硬體上的限制。在本篇論文中,我們將提出一個基於聊天訊息的長短期注意力架構(LSTA),實驗結果表明,我們基於聊天訊息的設計,是更為可靠用來偵測精彩片段的方法。


In recent years, live-streaming services have been booming and are still continuing to grow on the Internet.
Differing from TV shows and movies, live-streaming can have variable and longer lengths with no specific content restrictions.
Traditional methods of video highlight detection, which are based on visual features, will suffer the difficulties of data scale and inconsistency.
To address these issues, we alternatively extract information from the audience discussion in a chat room for highlight detection.
In this thesis, an attention-based model, LSTA, is proposed to integrate the long term and short term information in a chat room to determine which fragments should be identified as highlights.
Our results demonstrate the improvement over both state-of-the-art visual and textual content-based approaches.

Recommendation Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Approval Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Highlight Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Highlight Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.4.1 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.4.2 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.4.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.5 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.5.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.5.2 Long-Short Term Attention (LSTA) Model . . . . . . . . . . . . 9 4 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1.1 Text Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1.2 Video-Text Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1.3 ESports Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3.2 Deal with Imbalanced Data . . . . . . . . . . . . . . . . . . . . . 16 4.3.3 Deal with Variable-Length Data . . . . . . . . . . . . . . . . . . 16 4.4 Experiment Results and Discussions . . . . . . . . . . . . . . . . . . . . 17 4.4.1 The Comparison of Different Designs of Context Window . . . . 17 4.4.2 Compare with Textual Content-Based Methods . . . . . . . . . . 18 4.4.3 Compare with Visual Content-Based Methods . . . . . . . . . . . 18 4.4.4 Discussions on the Visualized Results . . . . . . . . . . . . . . . 21 4.4.5 Discussions on the Relation of Video Length and Performance . . 21 4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.5.1 The Importance of Each Feature . . . . . . . . . . . . . . . . . . 23 4.5.2 Improve the Model Using Only Frequency and Diversity . . . . . 24 4.6 Apply a Different Process to Generate the Global View of a Segment . . . 26 5 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

[1] T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep ranking for first­person video summarization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 982–990, June 2016.
[2] A. Javed, K. B. Bajwa, H. Malik, and A. Irtaza, “An efficient framework for automatic highlights generation from sports videos,” IEEE Signal Processing Letters, vol. 23, pp. 954–958, July 2016.
[3] Y. Song, “Real­time video highlights for yahoo esports,” arXiv preprint arXiv:1611.08780, 2016.
[4] H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo, and B. Guo, “Unsupervised extraction of video highlights via robust recurrent auto­encoders,” in Proceedings of the IEEE international conference on computer vision, pp. 4633–4641, Dec 2015.
[5] J. Wang, C. Xu, E. Chng, and Q. Tian, “Sports highlight detection from keyword sequences using hmm,” in 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 1, pp. 599–602, IEEE, June 2004.
[6] C.­C. Cheng and C.­T. Hsu, “Fusion of audio and motion information on hmm­based highlight extrac­tion for baseball games,” IEEE Transactions on Multimedia, vol. 8, pp. 585–599, June 2006.
[7] E. Chu and D. Roy, “Audio­visual sentiment analysis for learning emotional arcs in movies,” in 2017 IEEE International Conference on Data Mining (ICDM), pp. 829–834, IEEE, Nov 2017.
[8] L.­C. Hsieh, C.­W. Lee, T.­H. Chiu, and W. Hsu, “Live semantic sport highlight detection based on analyzing tweets of twitter,” in 2012 IEEE International Conference on Multimedia and Expo, pp. 949–954, IEEE, July 2012.
[9] G. Lv, T. Xu, E. Chen, Q. Liu, and Y. Zheng, “Reading the videos: Temporal labeling for crowdsourced time­sync videos based on semantic embedding,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 3000–3006, AAAI Press, 2016.
[10] C.­Y. Fu, J. Lee, M. Bansal, and A. Berg, “Video highlight prediction using audience chat reactions,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, (Copen­hagen, Denmark), pp. 972–978, Association for Computational Linguistics, 2017.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016.
[12] Q. Ping and C. Chen, “Video highlights detection and summarization with lag­calibration based on concept­emotion mapping of crowdsourced time­sync comments,” in Proceedings of the Workshop on New Frontiers in Summarization, (Copenhagen, Denmark), pp. 1–11, Association for Computational Linguistics, 2017.
[13] Q. Ping, “Video recommendation using crowdsourced time­sync comments,” in Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, (New York, NY, USA), pp. 568–572, ACM, 2018.
[14] R. Jiang, C. Qu, J. Wang, C. Wang, and Y. Zheng, “Towards extracting highlights from recorded live videos: An implicit crowdsourcing approach,” arXiv preprint arXiv:1910.12201, 2019.
[15] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention­based encoder­decoder networks,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019.
[16] T.­J. Fu, S.­H. Tai, and H.­T. Chen, “Attentive and adversarial learning for video summarization,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1579–1587, IEEE, Jan 2019.
[17] S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
[18] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, pp. 379–423, July 1948.
[19] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword informa­tion,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
[20] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Pro­ceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Valletta, Malta), pp. 45–50, ELRA, May 2010. http://is.muni.cz/publication/884893/en.
[21] B. Zhang and R. Sennrich, “A Lightweight Recurrent Network for Sequence Modeling,” in Proceed­ings of the 57th Annual Meeting of the Association for Computational Linguistics, (Florence, Italy), pp. 1538–1548, Association for Computational Linguistics, 2019.
[22] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Sig­nal Processing, vol. 45, pp. 2673–2681, Nov 1997.
[23] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention­based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Pro­cessing, (Lisbon, Portugal), pp. 1412–1421, Association for Computational Linguistics, 2015.
[24] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, pp. 436–444, May 2015.
[25] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing inter­nal covariate shift,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning ­ Volume 37, ICML'15, p. 448–456, JMLR.org, 2015.
[26] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm­crf models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.

無法下載圖示 全文公開日期 2025/08/23 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE