研究生: |
盧立恩 Li-En Lu |
---|---|
論文名稱: |
基於人臉與語者分群之影片視覺化系統 A Video Visualization System Based on Face and Speaker Clustering |
指導教授: |
楊傳凱
Chuan-Kai Yang |
口試委員: |
楊傳凱
Chuan-Kai Yang 羅乃維 Nai-Wei Lo 林伯慎 Bor-Shen Lin |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 中文 |
論文頁數: | 49 |
中文關鍵詞: | 人臉追蹤 、場景變換偵測 、人臉分群 、自動語者分段標記 、語者分群 |
外文關鍵詞: | Face Tracking, Scene Change Detection, Face Clustering, Speaker Diarization, Speaker Clustering |
相關次數: | 點閱:177 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在剛觀賞一部影片時,可能因為對人的臉部特徵不熟悉或者對其聲音特色敏感度較低等原因,因而有分不清誰是誰的情況。因此本論文提出一個影片視覺化系統,在本系統中可輸入一段影片,我們先利用PySceneDetect做場景變換偵測以及MTCNN做人臉追蹤,再分別萃取出臉部與聲音特徵,用DBSCAN演算法做人臉分群與UISRNN演算法做語者分群,最後利用有單獨角色的影片片段將人臉與語者的分群結果做關聯。從實驗上可以發現,進行二分匹配後的人臉-語者系統可有更高的分群正確率,此外藉由影片視覺化系統,在觀賞影片之餘,也能藉由輔助資訊了解該片段由哪位演員飾演。
When we watch a video, we might find it hard to differentiate a character because of his/her unfamiliar face or voice feature. Therefore, we proposed a video visualization system. When we input a video into our system, we use PySceneDetect for scene change detection and MTCNN for face tracking. Then we extract facial and voice features to use DBSCAN algorithm for face clustering and UISRNN algorithm for speaker clustering. Finally, we find the correspondence between the results of face and speaker clustering by the video fragments of a single character. In our experiment, we observed that using bipartite matching from single face and speaker system showed higher clustering accuracy. Furthermore, people could use our system to know that which actor appears through auxiliary information.
[1] Elie Khoury, Christine Senac, and Philippe Joly. Audiovisual diarization ofpeople in video content.Multimedia Tools and Applications, 2014.
[2] Giulia Garau, Alfred Dielmann, and Herve Bourlard. Audio?visual synchroni-sation for speaker diarisation. 2010.
[3] Herv Bredin and Grgory Gelly. Improving speaker diarization of tv series usingtalking-face detection and clustering. pages 157–161, 2016.
[4] I. D. Gebru, S. Ba, X. Li, and R. Horaud. Audio-visual speaker diarization basedon spatiotemporal bayesian fusion.IEEE Transactions on Pattern Analysis andMachine Intelligence, 40(5):1086–1099, 2018.
[5] Kaipeng Zhang, Zhanpeng Zhang, and Zhifeng Li. Joint face detection andalignment using multi-task cascaded convolutional networks.IEEE Signal Pro-cessing Letters, 23(3):1499–1503, 2015.
[6] Bindu Reddy and Anita Jadhav. Comparison of scene change detection algo-rithms for videos.2015 Fifth International Conference on Advanced ComputingCommunication Technologies, pages 84–89, 2015.
[7] Makarand Tapaswi, Marc T.Law, and Sanja Fidler. Video face clustering withunknown number of clusters.ICCV 2019, 2019.
[8] N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking witha deep association metric. pages 3645–3649, 2017.
[9] Mtcnn face detector.https://github.com/ipazc/mtcnn.
[10] Pyscenedetect.https://github.com/Breakthrough/PySceneDetect.
[11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise.1996.
[12] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Utterance-level aggregation for speaker recognition in the wild. pages 5791–5795, 2019.
[13] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deepspeaker recognition. pages 1086–1090, 2018.
[14] Yujie Zhong, Relja Arandjelovic, and Andrew Zisserman. Ghostvlad for set-based face recognition. 2018.
[15] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang. Fully supervised speakerdiarization. pages 6301–6305, 2019.
[16] Speaker diarization.https://github.com/taylorlu/Speaker-Diarization.
[17] Imran Sheikh, Rupayan Chakraborty, and Sunil Kumar Kopparapu. Audio-visual fusion for sentiment classification using cross-modal autoencoder. 2018.
[18] Qiuqiang Kong, Turab Iqbal, Yong Xu, Wenwu Wang, and Mark Plumbley.Dcase 2018 challenge surrey cross-task convolutional neural network baseline.2018.