研究生: |
蔡竣泓 Chun-Hung Tsai |
---|---|
論文名稱: |
解纏臉形學習之大角度人臉重演 Disentangled Shape Learning for Large-Pose Face Reenactment |
指導教授: |
徐繼聖
Gee-Sern Hsu |
口試委員: |
莊永裕
Yung-Yu Chuang 林嘉文 Chia-Wen Lin 陳祝嵩 Chu-Song Chen 郭景明 Jing-Ming Guo 徐繼聖 Gee-Sern Hsu |
學位類別: |
碩士 Master |
系所名稱: |
工程學院 - 機械工程系 Department of Mechanical Engineering |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 60 |
中文關鍵詞: | 深度學習 、人臉重演 |
外文關鍵詞: | Deep Learning, Face Reenactment |
相關次數: | 點閱:262 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
我們提出用於大角度人臉重演的Disentangled Shape Learning (DSL) network。DSL由兩個主要的模組組成:1. 解纏臉型抽取器 (Disentangled Shape Encoder, DSE);2. 目標人臉生成器 (Target Face Generator, TFG)。輸入一張參考人臉,DSE首先產生出一張投影正規化坐標碼 (Projected Normalized Coordinate Code, PNCC)來解纏參考人臉之表情和身份,並將PNCC串接參考人臉之三維人臉地標點作為DSE的輸出。使用DSE的輸出和來源人臉的編碼作為TFG的輸入,TFG可以產生一張目標的重演人臉圖像。由於解纏了參考人臉的表情和身份,使生成的目標重演人臉擁有參考人臉的姿態和表情,並更好的保留來源人臉的身份訊息。本方法在Multi-PIE、VoxCeleb1和VoxCeleb2資料庫上與相關文獻進行比較,結果證實本方法皆展現出優異的競爭力。為了展現大角度人臉重演之效能,我們對上述資料庫設計了只考慮大角度資料的評估協定。
We propose the Disentangled Shape Learning (DSL) network for large-pose face reenactment. The DSL is realized by two modules, one is the Disentangled Shape Encoder (DSE) and the other is the Target Face Generator (TFG). Given a reference face, the DSE first generates the Projected Normalized Coordinate Code (PNCC) to disentangle the expression and identity of the reference, and concatenates the PNCC with the 3D landmarks of the reference as output code. The TFG takes the DSE output code and the facial code of the source face to generate the target reenacted face. Due to the expression and identity disentanglement of the reference, the generated target face can better preserve the source identity and exhibit the expression and pose of the reference. We evaluate the proposed approach on the MPIE, VoxCeleb1 and VoxCeleb2 benchmarks and compare with state-of-the-art approaches. To better demonstrate the performance for large-pose reenactment, we design the protocols that only consider the large-pose data in the above benchmarks.
[1] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 1999.
[2] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, 2013.
[3] Jie Cao, Yibo Hu, Hongwen Zhang, Ran He, and Zhenan Sun. Towards high fidelity face frontalization in the wild. In IJCV, 2020.
[4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognising faces across pose and age. In FG, 2018.
[5] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
[6] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In CVPR, 2020.
[7] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.
[8] Deng, Jiankang, et al. "Arcface: Additive angular margin loss for deep face recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.
[9] Michail Christos Doukas, Stefanos Zafeiriou, and Viktoriia Sharmanska. Headgan: One-shot neural head synthesis and editing. In IEEE/CVF International Conference on Com- puter Vision (ICCV), 2021.
[10] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and Vision Computing, 2010.
[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
[12] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, 2017.
[13] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
[14] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
[15] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[16] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
[17] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[18] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pages 296–301. Ieee, 2009.
[19] Aliaksandr Siarohin, St ́ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. In NIPS, 2019.
[20] Christian Szegedy, Vincent Vanhoucke, Serget Loffe, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
[21] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
[22] Olivia Wiles, A Sophia Koepke, and Andrew Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In ECCV, 2018.
[23] Wayne Wu, Yunxuan Zhang, Cheng Li, Chen Qian, and Chen Change Loy. Reenactgan: Learning to reenact faces via boundary transfer. In ECCV, 2018.
[24] Guangming Yao, Yi Yian, Tianjia Shao, and Kun Zhou. Mesh guided one-shot face reenactment using graph convolutional networks. arXiv preprint arXiv:2008.07783, 2020.
[25] Egor Zakharov, Aleksei Ivakhnenko, Aliaksandra Shysheya, and Victor Lempitsky. Fast bi-layer neural synthesis of one- shot realistic head avatars. In European Conference of Computer vision (ECCV), August 2020.
[26] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realis- tic neural talking head models. In ICCV, 2019.
[27] Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, and Changjie Fan. Freenet: Multi-identity face reenactment. In CVPR, 2020.
[28] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[29] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.