基於人臉與語者分群之影片視覺化系統｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	盧立恩 Li-En Lu
論文名稱：	基於人臉與語者分群之影片視覺化系統 A Video Visualization System Based on Face and Speaker Clustering
指導教授：	楊傳凱 Chuan-Kai Yang
口試委員:	楊傳凱 Chuan-Kai Yang 羅乃維 Nai-Wei Lo 林伯慎 Bor-Shen Lin
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理系 Department of Information Management
論文出版年：	2020
畢業學年度：	108
語文別：	中文
論文頁數：	49
中文關鍵詞：	人臉追蹤、場景變換偵測、人臉分群、自動語者分段標記、語者分群
外文關鍵詞：	Face Tracking, Scene Change Detection, Face Clustering, Speaker Diarization, Speaker Clustering
相關次數：	點閱：177 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

在剛觀賞一部影片時，可能因為對人的臉部特徵不熟悉或者對其聲音特色敏感度較低等原因，因而有分不清誰是誰的情況。因此本論文提出一個影片視覺化系統，在本系統中可輸入一段影片，我們先利用PySceneDetect做場景變換偵測以及MTCNN做人臉追蹤，再分別萃取出臉部與聲音特徵，用DBSCAN演算法做人臉分群與UISRNN演算法做語者分群，最後利用有單獨角色的影片片段將人臉與語者的分群結果做關聯。從實驗上可以發現，進行二分匹配後的人臉－語者系統可有更高的分群正確率，此外藉由影片視覺化系統，在觀賞影片之餘，也能藉由輔助資訊了解該片段由哪位演員飾演。

When we watch a video, we might find it hard to differentiate a character because of his/her unfamiliar face or voice feature. Therefore, we proposed a video visualization system. When we input a video into our system, we use PySceneDetect for scene change detection and MTCNN for face tracking. Then we extract facial and voice features to use DBSCAN algorithm for face clustering and UISRNN algorithm for speaker clustering. Finally, we find the correspondence between the results of face and speaker clustering by the video fragments of a single character. In our experiment, we observed that using bipartite matching from single face and speaker system showed higher clustering accuracy. Furthermore, people could use our system to know that which actor appears through auxiliary information.

中文摘要
英文摘要
誌謝
目錄
圖目錄
表目錄
第一章 緒論
1 研究背景
2 研究動機與目的
3 論文架構
第二章 文獻探討
1 人物自動分段標記
2 人臉偵測與對齊
3 場景變換偵測
4 人臉分群
第三章 演算法設計與系統實作
1 系統流程
2 人臉影像提取
2.1 人臉追蹤
2.2 場景變換偵測
3 人臉分群
3.1 人臉特徵向量之取得
3.2 DBSCAN演算法
4 語者分群
4.1 聲音特徵向量之取得
4.2 UISRNN演算法
5 二分匹配
5.1 單獨角色影片片段之提取
5.2 人臉與語者分群之二分匹配
第四章 結果展示與評估
1 系統環境
2 影片視覺化系統
2.1介面說明
3 資料集
4 實驗結果
4.1 實驗結果一：一般結果
4.2 實驗結果二：過場畫面
4.3 實驗結果三：二分匹配之反例
5 實驗評估
5.1 Prediction分群ID之轉換
5.2 Ground Truth定義
5.3 Baseline與Proposed之比較
5.4 執行時間
第五章 結論與未來展望
參考文獻

                                

[1] Elie Khoury, Christine Senac, and Philippe Joly. Audiovisual diarization ofpeople in video content.Multimedia Tools and Applications, 2014.
[2] Giulia Garau, Alfred Dielmann, and Herve Bourlard. Audio?visual synchroni-sation for speaker diarisation. 2010.
[3] Herv Bredin and Grgory Gelly. Improving speaker diarization of tv series usingtalking-face detection and clustering. pages 157–161, 2016.
[4] I. D. Gebru, S. Ba, X. Li, and R. Horaud. Audio-visual speaker diarization basedon spatiotemporal bayesian fusion.IEEE Transactions on Pattern Analysis andMachine Intelligence, 40(5):1086–1099, 2018.
[5] Kaipeng Zhang, Zhanpeng Zhang, and Zhifeng Li. Joint face detection andalignment using multi-task cascaded convolutional networks.IEEE Signal Pro-cessing Letters, 23(3):1499–1503, 2015.
[6] Bindu Reddy and Anita Jadhav. Comparison of scene change detection algo-rithms for videos.2015 Fifth International Conference on Advanced ComputingCommunication Technologies, pages 84–89, 2015.
[7] Makarand Tapaswi, Marc T.Law, and Sanja Fidler. Video face clustering withunknown number of clusters.ICCV 2019, 2019.
[8] N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking witha deep association metric. pages 3645–3649, 2017.
[9] Mtcnn face detector.https://github.com/ipazc/mtcnn.
[10] Pyscenedetect.https://github.com/Breakthrough/PySceneDetect.
[11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise.1996.
[12] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Utterance-level aggregation for speaker recognition in the wild. pages 5791–5795, 2019.
[13] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deepspeaker recognition. pages 1086–1090, 2018.
[14] Yujie Zhong, Relja Arandjelovic, and Andrew Zisserman. Ghostvlad for set-based face recognition. 2018.
[15] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang. Fully supervised speakerdiarization. pages 6301–6305, 2019.
[16] Speaker diarization.https://github.com/taylorlu/Speaker-Diarization.
[17] Imran Sheikh, Rupayan Chakraborty, and Sunil Kumar Kopparapu. Audio-visual fusion for sentiment classification using cross-modal autoencoder. 2018.
[18] Qiuqiang Kong, Turab Iqbal, Yong Xu, Wenwu Wang, and Mark Plumbley.Dcase 2018 challenge surrey cross-task convolutional neural network baseline.2018.

全文公開日期 2025/08/27 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文