一個基於生成對抗式網路的3D人臉說話影片生成系統｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	張凱渝 Kai-Yu Zhang
論文名稱：	一個基於生成對抗式網路的3D人臉說話影片生成系統 A 3D Face Talking Video Generation System Based on Generative Adversarial Networks
指導教授：	范欽雄 Chin-Shyurng Fahn
口試委員:	繆紹綱 Shaou-Gang Miaou 王榮華 Jung-Hua Wang 馮輝文 Huei-Wen Ferng 范欽雄 Chin-Shyurng Fahn
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2023
畢業學年度：	112
語文別：	英文
論文頁數：	51
中文關鍵詞：	生成對抗式網路、3D臉部模型、唇形同步、UV紋理生成、紋理映射、紋理渲染、非線性三維可變形人臉模型、多模態深度學習
外文關鍵詞：	Generative adversarial network,, 3D face model, lip synchronize, UV texture generation, texture mapping, texture rendering, non-linear 3DMM, multimodal deep learning
相關次數：	點閱：83 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在語音處理、影像處理，以及人臉辨識的研究中，唇形同步是相當重要的一項技術，透過這項技術，我們可以獲得二維人臉的唇形動作資訊，這種資訊對於多領域的應用至關重要，從視覺特效製作到增強現實技術，皆需要準確的唇形同步以確保視聽表現的一致性。過去，唇形同步主要依賴基於規則和傳統的方法，透過對語音和影像的時間對齊來實現同步。這些傳統方法通常使用人工定義的規則或模型來捕捉語音和嘴唇運動之間的關係，必定面臨著一些限制。因此，在本篇論文中，我們將深度學習模型應用於唇形同步上，希望能藉由深度學習的方式，提升唇形同步的影像和聲音的方法，同時，藉由紋理渲染的技術，將2D人臉說話影片轉換為3D人臉說話影片。
傳統應用生成對抗式網路的人臉說話影片生成模型，通常需要處理複雜的數據流，包括大量的影片和對應的語音資料。我們提出了一個使用Wav2Lip架構的唇形同步生成模型，它是由一個生成器與基於WGANs架構的評判器組合而成，其中，生成器可以依照我們輸入的人臉資料與語音資料生成對應的唇形同步影像，而評判器則負責判別其輸入影像是否為真實影像，透過生成器和評判器之間的競爭過程，不斷調整和優化，以生成具有與輸入資訊相符且逼真的唇形動作影片。
在實驗結果部分，我們將我們提出的模型與 Speech2Vid、LipGAN、Wav2Lip 和 Wav2Lip + GANs 進行比較。實驗結果顯示，我們提出的模型表現良好，且相較於其它四個模型，我們提出的模型能夠生成更為逼真的唇形動作影片。在實驗數據的表現上，我們分別在LRW和LRS2兩個資料集下進行唇形同步的訓練與測試，實驗結果的均方誤差為33.40/31.03，低於 Wav2Lip + GANs 得到的 36.09/35.96，Wav2lip 得到的 54.29/56.16、LipGAN 得到的 59.68/59.33，也低於Speech2Vid獲得的63.12/62.41。

In the research of speech processing, image processing, and face recognition, the lip synchronization is a very important technology, through which we can obtain the lip movement information of a 2D face, which is crucial for multi-disciplinary applications, from visual effects production to augmentation of real-world techniques, all of which require accurate lip synchronization to ensure the consistency of audio-visual performances. In the past, lip synchronization has relied on rule-based and traditional methods that synchronize voice and video by aligning their timing. These traditional methods usually use artificially defined rules or models to capture the relationship between speech and lip movements, and are bound to face some limitations. Therefore, in this thesis, we apply a deep learning model to lip synchronization, hoping to enhance the image and sound method of lip synchronization by deep learning and, at the same time, to convert 2D face speaking video into 3D face speaking video by texture mapping.
Traditional face talking video generation models that apply to the generation of adversarial networks usually need to deal with complicated data flows, including a lot of videos and corresponding voice data. We propose a lip-synchronization generation model using the Wav2Lip architecture, which consists of a generator combined with a critic based on the WGANs architecture. In this model, the generator can generate lip-synchronized images based on our input face data and speech data, while the critic determines whether the input images are real or not based on the comparison of our input fake lip-synchronized images with the real situation, and through the competition process between the generator and the critic, the lip-synchronized movie with realistic lip shapes matching the input information is generated by continuous adjustment and optimization.
We contrast our model with Speech2Vid, LipGAN, Wav2Lip, and Wav2Lip + GANs in the section on experimental data. The testing findings demonstrate that, in comparison to the other four models, our model works effectively and produces more lifelike lip motion videos. In terms of experimental data performance, we train and test the lip synchronization under two datasets, LRW and LRS2, respectively, and the mean squared error of the experimental results is 33.40/31.03, which is lower than 36.09/35.96 obtained by Wav2Lip + GANs, 54.29/56.16 obtained by Wav2lip, 59.68/59.33 obtained by LipGAN, and also lower than 63.12/62.41 obtained by Speech2Vid.

Contents
中文摘要	i
Abstract	ii
List of Figures	vii
List of Tables	viii
Chapter 1	Introduction	1
1	Overview	1
2	Motivation	2
3	System Description	3
4	Thesis Organization	4
Chapter 2	Related Work	5
1	UV Face Texture Generation	5
2	UV Mapping	7
3	3D Face Talking Generation Method Based on Deep Learning	8
3.1	Multimodal deep learning	9
3.2	Generative adversarial networks	11
3.3	3D face modeling	14
Chapter 3	Our 3D Face Talking Video Generation Model	17
1	Data Preprocessing	17
2	Face Talking Video Generation Model	18
2.1	Generator	19
2.2	Visual quality critic	20
2.3	WGANs loss function	22
Chapter 4	Experimental Results and Discussion	23
1	Experimental Environment Setup	23
2	Dataset of Lip-Reading	24
2.1	The Oxford-BBC Lip Reading in the Wild (LRW) Dataset	24
2.2	The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset	25
3	Results of Talking Face Video Generation	26
3.1	Evaluation metrics	26
3.2	Comparison of our model and the others	29
3.3	Ablation study on the sync loss	36
3.4	Ablation study on increasing input features	37
4	Results of 3D Face Talking Video Generation	41
Chapter 5	Conclusions and Future Work	45
1	Conclusions	45
2	Future Work	46
References	48
                                

[1] C. Boehnen and P. Flynn, “Accuracy of 3D scanning technologies in a face scanning scenario,” in Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling, 2005, Ottawa, Ontario, Canada, pp. 310-317.
[2] N. Patel and M. Zaveri, “3D Facial model construction and expressions synthesis from a single frontal face image,” in Proceedings of the 2010 International Conference on Computer and Communication Technology, 2010, Allahabad, India, pp. 652-657.
[3] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, “Generative adversarial networks,” arXiv:1406.2661 [cs.LG], Jun. 2014.
[4] P. S. Joe, Y. Ito, A. M. Shih, R. K. Oestenstad and C. T. Lungu, “Comparison of a novel surface laser scanning anthropometric technique to traditional methods for facial parameter measurements,” Journal of Occupational and Environmental Hygiene, vol. 9, no. 2, pp. 81-88, 2012.
[5] M. D. Levine and Y. C. Yu, “State-of-the-art of 3D facial reconstruction methods for face recognition based on a single 2D training image per person,” Pattern Recognition Letters, vol. 30, no. 10, pp. 908-913, 2009
[6] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “FaceWarehouse: a 3d facial expression database for visual computing,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 3, pp.413-425, 2014.
[7] S. Ploumpis, H. Wang, N. Pears, W. A. P. Smith, and S. Zafeiriou, “Combining 3D morphable models: A large scale face-and-head model,” arXiv:1903.03785 [cs.CV], Mar. 2019.
[8] J. Deng, S. Cheng, N. Xue, Y. Zhou, and S. Zafeiriou, “UV-GAN: Adversarial facial UV map completion for pose-invariant face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, Utah, pp. 7093-7102.
[9] L. Tran and X. Liu, “Nonlinear 3D face morphable model,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018 Salt Lake City, Utah, pp. 7346-7355.
[10] B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou, “GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, Long Beach, California, pp. 1155-1164.
[11] S. Saito, L. Wei, L. Hu, K. Nagano, and H. Li, “Photorealistic facial texture inference using deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Honolulu, Hawaii, pp. 2326-2335.
[12] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. T. Freeman, “Unsupervised training for 3d morphable model regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, Utah, pp. 8377-8386.
[13] Y. Fan, Y. Liu, G. Lv, S. Liu, G. Li, and Y. Huang, “Full face-and-head 3D model with photorealistic texture,” IEEE Access, vol. 8, pp. 210709-210721, 2020.
[14] S. Asadiabadi, R. Sadiq, and E. Erzin, “Multimodal speech driven facial shape animation using deep neural networks,” in Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2018, Honolulu, Hawaii, pp. 1508-1512.
[15] J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Proceedings of the Asian Conference on Computer Vision Workshops, 2016, Taipei, Taiwan, Revised Selected Papers, Part II 13, 2017, pp.251-263.
[16] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” arXiv:1405.3531 [cs.CV], May. 2014.
[17] K. R. Prajwal, R. Mukhopadhyay, J. Philip, A. Jha, V. P. Namboodiri, and C. V. Jawahar, “Towards automatic face-to-face translation,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, Melbourne, Victoria, pp.1428-1436.
[18] K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, Seattle, Washington, pp.484-492.
[19] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning, 2017, Sydney, Australia, vol. 70, pp. 214-223.
[20] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, 1999, New York, United States, pp. 187-194.
[21] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,” ACM Transactions on Graphics, vol. 36, no. 6, pp.1-17 ,2017.
[22] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3d face model for pose and illumination invariant face recognition,” in Proceedings of the 6th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2009, Genova, Italy, pp. 296-301.
[23] W. A. P. Smith, A. Seck, H. Dee, B. Tiddeman, J. Tenenbaum, and B. Egger, “A morphable face albedo model,” arXiv:2004.02711 [cs.CV], Apr. 2020.
[24] A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Pérez, and C. Theobalt, “MoFA: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, Venice, Italy, pp. 3735-3744.
[25] B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou, “GANFIT: Generative adversarial network fitting for high fidelity 3D face reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, Long Beach, California, pp. 1155-1164.
[26] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv:1312.6114 [stat.ML], Dec. 2013.
[27] T. Karras, Timo Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” arXiv:1710.10196 [cs.NE], Oct. 2017.
[28] L. Tran and X. Liu, “On learning 3D face morphable model from in-the-wild images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 157-171 ,2021.
[29] A. Lattas, S. Moschoglou, B. Gecer, S. Ploumpis, V. Triantafyllou, A. Ghosh, and S. Zafeiriou, “AvatarMe: Realistically renderable 3D facial reconstruction “in-the-wild”,” arXiv:2003.13845 [cs.CV], Mar. 2020.
[30] T. Bagautdinov, C. Wu, J. Saragih, P. Fua, and Y. Sheikh, “Modeling facial geometry using compositional VAEs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, Utah, pp. 3877-3886.
[31] S.-E. Wei, J. Saragih, T. Simon, A.W. Harley, S. Lombardi, M. Perdoch, A. Hypes, D. Wang, H. Badino, and Y. Sheikh, “VR facial animation via multiview image translation,” ACM Transactions on Graphic, vol. 38, no. 4, pp.1-16, 2019.
[32] S. Sharma and V. Kumar, “Voxelbased 3D face reconstruction and its application to face recognition using sequential deep learning,” Multimedia Tools and Applications, vol. 79, no. 25, pp. 17303-17330, 2020.
[33] X. Tu, J. Zhao, Z. Jiang, Y. Luo, M. Xie, Y. Zhao, L. He, Z. Ma, and J. Feng, “3D face reconstruction from a single image assisted by 2D face images in the wild,” IEEE Transactions on Multimedia, vol. 23, pp. 1160-1172, 2021.
[34] S. Gu, J. Bao, H. Yang, D. Chen, F. Wen, and L. Yuan, “Mask-guided portrait editing with conditional GANs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, Long Beach, California, pp. 3431-3440.
[35] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3D point clouds: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp.4338-4364, 2021.
[36] M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and Steven, “Deep learning for person re-identification: A survey and outlook,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp.2872-2893,2022.
[37] K. Vougioukas, S. Petridis, and M. Pantic, “End-to-end speech-driven facial animation with temporal GANs,” arXiv:1805.09313 [eess.AS], May. 2018.
[38] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3D face model for pose and illumination invariant face recognition,” in Proceedings of the 6th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2009, Genova, Italy, pp 296-301.

全文公開日期 2029/01/24 (校內網路)
全文公開日期 2034/01/24 (校外網路)
全文公開日期 2034/01/24 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文