簡易檢索 / 詳目顯示

研究生: 吳泓毅
Hung-Yi Wu
論文名稱: 以人臉地標點進行身分保留之臉部重現
Landmark-Oriented and Identity-Preserving Face Reenactment
指導教授: 徐繼聖
Gee-Sern Jison Hsu
口試委員: 洪一平
Yi-Ping Hung
莊永裕
Yung-Yu Chuang
鄭文皇
Wen-Huang Cheng
鍾聖倫
Sheng-Luen Chung
徐繼聖
Gee-Sern Jison Hsu
學位類別: 碩士
Master
系所名稱: 工程學院 - 機械工程系
Department of Mechanical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 62
中文關鍵詞: 臉部重現臉部生成人臉地標點
外文關鍵詞: Face Reenactment, Face Generation, Facial Landmark
相關次數: 點閱:154下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

我們提出用於臉部重現任務的Landmark-Oriented Identity-Preserving (LOIP) network。透過輸入一張來源人臉及一張參考人臉,LOIP能夠生成一張人臉圖片,具有與參考人臉相同的姿態、表情,並且同時保留來源人臉的身分特徵。LOIP由兩個主要的模組組成:1. 身分保留之人臉地標點轉換器 (ID-preserving Landmark Converter, IDLC);2. 臉部重現生成器 (Face Reenactment Generator, FRG)。身分保留之人臉地標點轉換器首先透過人臉地標點編碼器將參考人臉之人臉地標點編碼成姿態、表情編碼,再將來源人臉影像輸入至身分編碼器中編碼成身分編碼,並將兩者串接輸入至人臉地標點解碼器,解碼成目標之人臉地標點。臉部重現生成器透過將來源人臉影像輸入至風格提取專家網路提取該影像之風格特徵編碼,再將由身分保留之人臉地標點轉換器生成的目標之人臉地標點、風格特徵編碼同時輸入至人臉生成器中,最後生成一張具有參考人臉的表情、姿態與來源人臉身分的目標人臉。本方法在RaFD和VoxCeleb1資料庫中皆展現優異的競爭力。


We propose the Landmark-Oriented Identity-Preserving (LOIP) network for face reenactment. Given a source face as input and a reference face, the LOIP network can generate an output face so that the output face has the same pose and expression as of the reference face, and has the same identity as of the source face. The proposed LOIP network is composed of two major modules, the ID-preserving Landmark Converter (IDLC) and the Face Reenactment Generator (FRG). The IDLC encodes the landmarks of the reference face by a landmark encoder, encodes the source face by a face expert, and decodes the concatenated landmark and face codes to a set of target landmarks that exhibits the pose and expression of the reference face and preserves the identity of the source face. The decoder in the IDLC is trained together with a landmark discriminator and a landmark-based subject classifier. The FRG is built on the StarGAN2 generator with a modification on the input and with a facial style expert added in. Given the target landmarks made by the IDLC and the source face as input, the FRG generates the target face with the desired identity, pose and expression. Evaluated on the RaFD and VoxCeleb1 benchmarks, the proposed framework outperforms state-of-the-art approaches.

摘要 4 Abstract 5 誌謝 6 目錄 7 圖目錄 9 表目錄 11 第1章 介紹 12 1.1 研究背景和動機 12 1.2 方法概述 14 1.3 論文貢獻 15 1.4 論文架構 16 第2章 文獻回顧 17 2.1 StarGAN v2 17 2.2 ReenactGAN 19 2.3 X2Face 21 2.4 pix2pixHD 22 2.5 Few-shot Talking Head 23 2.6 FReeNet 24 第3章 主要方法 26 3.1 整體網路架構 27 3.2 身分保留之人臉地標點轉換器設計 28 3.3 臉部重現生成器設計 32 第4章 實驗設置與分析 37 4.1 資料庫介紹 37 4.1.1 Radboud Faces Database 37 4.1.2 VoxCeleb1 38 4.1.3 Multi-PIE 39 4.2 實驗設置 41 4.2.1 資料劃分、設置 41 4.2.2 效能評估指標 42 4.2.3 實驗設計 44 4.3 實驗結果與分析 47 4.3.1 身分保留之人臉地標點轉換器之設置比較 47 4.3.2 臉部重現生成器之設置比較 47 4.3.3 身分保留之人臉地標點轉換器對生成影像之影響 51 4.3.4 大角度之臉部重現 53 4.4 與相關文獻之比較 53 第5章 結論與未來研究方向 57 第6章 參考文獻 58

[1] Langner, Oliver, et al. "Presentation and validation of the Radboud Faces Database." Cognition and emotion 24.8 (2010): 1377-1388.
[2] Nagrani, Arsha, Joon Son Chung, and Andrew Zisserman. "Voxceleb: a large-scale speaker identification dataset." arXiv preprint arXiv:1706.08612 (2017).
[3] Zhang, Jiangning, et al. "FReeNet: Multi-Identity Face Reenactment." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[4] Zakharov, Egor, et al. "Few-shot adversarial learning of realistic neural talking head models." Proceedings of the IEEE International Conference on Computer Vision. 2019.
[5] Bulat, Adrian, and Georgios Tzimiropoulos. "How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)." Proceedings of the IEEE International Conference on Computer Vision. 2017.
[6] Dowson, D. C., and B. V. Landau. "The Fréchet distance between multivariate normal distributions." Journal of multivariate analysis 12.3 (1982): 450-455.
[7] Wang, Zhou, et al. "Image quality assessment: from error visibility to structural similarity." IEEE transactions on image processing 13.4 (2004): 600-612.
[8] Deng, Jiankang, et al. "Arcface: Additive angular margin loss for deep face recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
[9] Choi, Yunjey, et al. "Stargan v2: Diverse image synthesis for multiple domains." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[10] Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition." (2015).
[11] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. "Rectifier nonlinearities improve neural network acoustic models." Proc. icml. Vol. 30. No. 1. 2013.
[12] Bulat, Adrian, and Georgios Tzimiropoulos. "Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources." Proceedings of the IEEE International Conference on Computer Vision. 2017.
[13] Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." European conference on computer vision. Springer, Cham, 2016.
[14] Choi, Yunjey, et al. "Stargan v2: Diverse image synthesis for multiple domains." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[15] Huang, Xun, and Serge Belongie. "Arbitrary style transfer in real-time with adaptive instance normalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.
[16] Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2019.
[17] Garrido, Pablo, et al. "Automatic face reenactment." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
[18] Tripathy, Soumya, Juho Kannala, and Esa Rahtu. "Icface: Interpretable and controllable face reenactment using gans." The IEEE Winter Conference on Applications of Computer Vision. 2020.
[19] Zhang, Yunxuan, et al. "One-shot face reenactment." arXiv preprint arXiv:1908.03251 (2019).
[20] Bregler, Christoph, Michele Covell, and Malcolm Slaney. "Video rewrite: Driving visual speech with audio." Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 1997.
[21] Kim, Hyeongwoo, et al. "Deep video portraits." ACM Transactions on Graphics (TOG) 37.4 (2018): 1-14.
[22] Thies, Justus, et al. "Face2face: Real-time face capture and reenactment of rgb videos." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[23] Zakharov, Egor, et al. "Few-shot adversarial learning of realistic neural talking head models." Proceedings of the IEEE International Conference on Computer Vision. 2019.
[24] Zhang, Jiangning, et al. "FReeNet: Multi-Identity Face Reenactment." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[25] Choi, Yunjey, et al. "Stargan: Unified generative adversarial networks for multi-domain image-to-image translation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[26] Karras, Tero, et al. "Progressive growing of gans for improved quality, stability, and variation." arXiv preprint arXiv:1710.10196 (2017).
[27] Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[28] Zhu, Jun-Yan, et al. "Unpaired image-to-image translation using cycle-consistent adversarial networks." Proceedings of the IEEE international conference on computer vision. 2017.
[29] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
[30] Wu, Wayne, et al. "Look at boundary: A boundary-aware face alignment algorithm." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[31] Wu, Wayne, et al. "Reenactgan: Learning to reenact faces via boundary transfer." Proceedings of the European conference on computer vision (ECCV). 2018.
[32] Wiles, Olivia, A. Sophia Koepke, and Andrew Zisserman. "X2face: A network for controlling face generation using images, audio, and pose codes." Proceedings of the European conference on computer vision (ECCV). 2018.
[33] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.
[34] Wang, Ting-Chun, et al. "High-resolution image synthesis and semantic manipulation with conditional gans." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[35] Cordts, Marius, et al. "The cityscapes dataset for semantic urban scene understanding." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[36] Silberman, Nathan, et al. "Indoor segmentation and support inference from rgbd images." European conference on computer vision. Springer, Berlin, Heidelberg, 2012.
[37] Zhou, Bolei, et al. "Scene parsing through ade20k dataset." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[38] Le, Vuong, et al. "Interactive facial feature localization." European conference on computer vision. Springer, Berlin, Heidelberg, 2012.
[39] Smith, Brandon M., et al. "Exemplar-based face parsing." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013.
[40] Gross, Ralph, et al. "Multi-pie." Image and Vision Computing 28.5 (2010): 807-813.
[41] HyperLandmark https://github.com/zeusees/HyperLandmark
[42] Chung, Joon Son, Arsha Nagrani, and Andrew Zisserman. "Voxceleb2: Deep speaker recognition." arXiv preprint arXiv:1806.05622 (2018).
[43] Rossler, Andreas, et al. "Faceforensics++: Learning to detect manipulated facial images." Proceedings of the IEEE International Conference on Computer Vision. 2019.

無法下載圖示 全文公開日期 2026/02/04 (校內網路)
全文公開日期 2026/02/04 (校外網路)
全文公開日期 2026/02/04 (國家圖書館:臺灣博碩士論文系統)
QR CODE