Basic Search / Detailed Display

Author: 張宗雅
Zhung-Ya Chang
Thesis Title: 基於影劇故事分析之影片摘要
Movie Summary Based on Story Analysis
Advisor: 楊傳凱
Chuan-Kai Yang
Committee: 林伯慎
Bor-Shen Lin
賴源正
Yuan-Cheng Lai
Degree: 碩士
Master
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2023
Graduation Academic Year: 111
Language: 中文
Pages: 67
Keywords (in Chinese): 影片摘要自然語言人臉辨識語者分割聚類
Keywords (in other languages): Video summarization, Natural language, Face recognition, Speaker diarization
Reference times: Clicks: 583Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 在觀賞長篇連戲劇或是一部電影續作時,可能會遇到忘記先前劇情的狀況,而且一部電影通常耗時90分鐘,歐美的系列連續劇更是多達數十幾集。藉由影片摘要將影片重要片段篩選,可幫助使用者可以迅速回顧影片內容。

    為上述目的,本論文提出一個影片摘要系統,在本系統中可輸入一部影片,系統中則有三種不同模型會分別處理電影文本、畫面辨識和聲音分析,當中結合了深度學習和自然語言處理等方法,來實現針對故事語意的影片摘要。

    畫面模型部分,我們則是用人臉辨識模型和語者分割聚類來辨別當前幀說話人是誰,在將對應的角色名字和字幕去做組合,作為補助電影摘要片段的根據。文本模型部分,我們先是把預處理好的字幕對話用抽象對話摘要模型獲得推論的摘要,再從IMDb資料庫獲得該影片的必要資訊(大綱、主要演員)等,將影片大綱(Synopsis)和字幕台詞(Subtitles)結合Transformer的模型找其語意關聯性,以找到最相關的台詞段落,再利用subtitles的時間資訊去找到對應畫面,最後產生摘要影片結果。


    When watching a long series of dramas or a movie sequel, you may encounter the situation of forgetting the previous plot, and a movie usually takes 90 minutes, and the serials in Europe and the United States have as many as dozens of episodes. Filtering the important parts of the video through the video summary can help users quickly review the content of the video.

    For the above purpose, this paper proposes a movie summarization system. In this system, a movie can be input. There are three different models in the system that can process the movie text, screen recognition and sound analysis respectively. Deep learning and natural language processing methods are combined to realize the movie summary for the semantics of the story.

    In the screen model part, we use the face recognition model and speaker grouping to identify who the speaker is in the current frame, and then combine the corresponding character name and subtitles as the basis for subsidizing the movie summary clip. In the text model part, we first use the abstract dialogue summary model to obtain the inference summary of the pre-processed subtitle dialogue, and then obtain the necessary information (synopsis, main actors) of the film from the IMDb database, etc., and combine the film synopsis and subtitle lines (Subtitles) with the Transformer model to find their semantic relevance to find the most relevant line paragraphs, and then use the time information of subtitles to find the corresponding screen, and finally generate the summary video results.

    中文摘要 III 英文摘要 IV 誌謝 VI 目 錄 VII 圖目錄 IX 表目錄 XI 第一章 緒論 1 1.1 研究動機與目的 1 1.2 論文架構 2 第二章 文獻探討 3 2.1 影片摘要 3 2.2 電影分析 6 2.3 人臉辨識 10 2.4 自然語言處理 12 2.5 語者分割聚類 15 第三章 演算法設計與系統實作 16 3.1 系統流程 16 3.2 影片前處理 18 3.2.1 空白幀 19 3.2.2 三分法構圖 19 3.2.3 鏡頭分割 20 3.2.4 人臉偵測 23 3.3 人臉辨識 25 3.4 語者分割聚類 28 3.5 文字前處理 30 3.5.1 字幕清理 30 3.5.2 IMDb資料的抓取 31 3.6 對話摘要 32 3.7 語意相似度 33 3.7.1 Difflib 33 3.7.2 Transformer模型 34 3.7.3 語意相似度摘要規則 34 第四章 結果展示與評估 36 4.1 系統環境 36 4.2 資料集 37 4.3 對話摘要評估方法 41 4.3.1 ROUGE 41 4.3.2 BERTScore 41 4.4 對話摘要實驗結果 43 4.5 語義相似度實驗結果 48 第五章 結論與未來展望 51 參考文獻 52

    [1] IMDb. http://www.imdb.com/. Accessed:2023-7-25.
    [2] Huang, Qingqiu and Xiong, Yu and Rao, Anyi and Wang, Jiaze and Lin, Dahua. MovieNet: A Holistic Dataset for Movie Understanding. In The European Conference on Computer Vision. ECCV, 2020.
    [3] Aumont Jacques, Michele Marie. In L’Analyse des films/Analysis of Film. Nathan, 1998.
    [4] Thomas Sobchack, Vivian Sobchack. In An Introduction to Film. Longman. ISBN 067339302X, 1997.
    [5] Rao, Anyi and Wang, Jiaze and Xu, Linning and Jiang, Xuekun and Huang, Qingqiu and Zhou, Bolei and Lin, Dahua. A Unified Framework for Shot Type Classification Based on Subject Centric Lens. In The European Conference on Computer Vision. ECCV, 2020.
    [6] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
    Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern
    recognition, pages 770–778, 2016.
    [8] K. Simonyan, and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR , 2015.
    [9] Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir Bourdev. Beyond frontal faces: Improving person recognition using multiple cues. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4804–4813, 2015.
    [10] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6047– 6056, 2018.
    [11] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
    [12] M. Otani, Y. Nakashima, T. Sato and N. Yokoya. Textual description-based video summarization for video blogs. In IEEE International Conference on Multimedia and Expo (ICME), 2015.
    [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
    [14] Mayu Otani and Yuta Nakashima and Esa Rahtu and Janne Heikkil and Naokazu Yokoya. Video Summarization Using Deep Semantic Features. In Asian Conference on Computer Vision, 2016.
    [15] van der Maaten, L., Hinton. G.E.: Visualizing high-dimensional data using t-SNE. In Journal of Machine Learning Research 9 (2008) 2579–2605.
    [16] Y. Taigman, M. Yang, M. Ranzato and L. Wolf. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1701-1708, doi: 10.1109/CVPR.2014.220.
    [17] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management (CIKM '13). Association for Computing Machinery, New York, NY, USA, 2333–2338.
    [18] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14). Association for Computing Machinery, New York, NY, USA, 101–110.
    [19] Omar Khattab and Matei Zaharia. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). Association for Computing Machinery, New York, NY, USA, 39–48, 2020.
    [20] A. Zhang, Q. Wang, Z. Zhu, J. Paisley and C. Wang. Fully Supervised Speaker Diarization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 6301-6305, doi: 10.1109/ICASSP.2019.8683892.
    [21] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 5776–5788.
    [22] Xie, Weidi & Nagrani, Arsha & Chung, Joon Son & Zisserman, Andrew. Utterance-level Aggregation for Speaker Recognition in the Wild. In 5791-5795. 10.1109/ICASSP.2019.8683120.
    [23] Rule of Thirds in Filmmaking. https://taketones.com/blog/rule-of-thirds-in-filmmaking. Accessed on 2023.
    [24] T. Baltrušaitis, P. Robinson and L. -P. Morency. OpenFace: An open source facial behavior analysis toolkit. In IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1-10, doi: 10.1109/WACV.2016.7477553.
    [25] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, 2005, pp. 886-893 vol. 1, doi: 10.1109/CVPR.2005.177.
    [26] Junpeng Liu, Yanyan Zou, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Caixia Yuan, Xiaojie Wang. Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1229-1243.
    [27] Lei, Jie and Yu, Licheng and Berg, Tamara L and Bansal, Mohit. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. In ECCV, 2020.
    [28] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. In arXiv:1904.09675.
    [29] Bogdan Gliwa, Iwona Mochol, Maciej Biesek, Aleksander Wawer. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70-79.
    [30] Video Structure. https://www.rankred.com/ibm-ai-that-detects-scene-in-video/. Accessed:2023-8-01.

    無法下載圖示 Full text public date 2026/08/08 (Intranet public)
    Full text public date 2028/08/08 (Internet public)
    Full text public date 2028/08/08 (National library)
    QR CODE