簡易檢索 / 詳目顯示

研究生: 余啓睿
Chi-Jui Yu
論文名稱: 基於風格生成對抗網路之高解析度換臉系統
High-Resolution Face Swapping using StyleGAN
指導教授: 王乃堅
Nai-Jian Wang
口試委員: 蘇順豐
Shun-Feng Su
鍾順平
Shun-Ping Chung
呂學坤
Shyue-Kung Lu
郭景明
Jing-Ming Guo
王乃堅
Nai-Jian Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 65
中文關鍵詞: 換臉類神經網路生成對抗網路自動編碼器深度學習
外文關鍵詞: Face swapping, Artificial neural networks, Generative adversarial networks, Autoencoders, Deep learning
相關次數: 點閱:742下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

換臉是一種透過替換影像中的人臉,從而達到改變其身份的技術,可以被應用於電影、廣告、教育以及社群媒體產業上,進而節省影像拍攝的時間與資金、重現歷史人物或是提供沉浸式體驗。
本篇論文提出的基於風格生成對抗網路之高解析度換臉系統,結合了交換自動編碼器的架構以及解開特徵的訓練方法,使編碼器能解讀出人臉影像的表情編碼與身份編碼,並藉由映射網路將任意人臉影像的編碼組合轉換成生成條件,使其控制解碼器生成特定影像。為了讓生成條件能更容易地控制解碼器,本篇論文應用了預訓練的StyleGAN 作為解碼器,因為其已被證實具有可解釋性操作的潛在空間,此特性有助於系統能更準確地生成特定影像。此外,本篇論文進一步分析並應用StyleGAN 將影像視為各種風格的集合之特性,實現了風格混合於解碼的過程,改善影像的真實度。最後再應用泊松融合演算法進行無縫融合,以獲得更逼真的換臉效果。
實驗結果顯示此換臉系統可以在開源資料集FFHQ 上進行訓練,並成功地在資料集CelebA-HQ 上實現高解析度(1024x1024)的逼真換臉效果,且可以輕鬆地應用於任意一對的人臉影像,而不需再重新訓練系統內部的神經網路,從而提升系統的實用度。另外,本篇論文使用影像品質評估指標進行換臉系統的效能驗證,並在SSIM 與FID 指標分別取得了0.93 與11.67 的分數。


Face swapping achieves the replacement of the identity by modifying the textures on the face. This technology can be applied to film, commercial, education and social media industries to save time and money in filming, recreate historical figures, or provide an immersive experience.
We propose a high-resolution (1024x1024) face swapping system, which combines the swapping autoencoder and the disentanglement learning framework, so that the encoder can embed a face image into the expression code and the identity code in a disentangled way. Then, it learns a mapping network that can directly maps any swapped combination to a series of style vectors which are used to control a pre-trained StyleGAN generator. This mapping allows the generator to produce a realistic face image matching the input information. In addition, we further analyze and apply style mixing in the decoding process to improve the fidelity of the swapped face. Finally, we incorporate Poisson blending algorithm for seamless blending of the two faces to obtain a more realistic human face image.
Experimental results show that our high-resolution face swapping system can be trained on the FFHQ dataset and be evaluated on the CelebA-HQ dataset. Crucially, the system can be applied to pairs of faces without the need to retrain the system, which enabling more practical scenarios. As a result, our system gets 0.93 in the SSIM metric and 11.67 in the FID metric.

摘要 I Abstract II 致謝 III 目錄 IV 圖目錄 VI 表目錄 VIII 第一章 緒論 1 1.1 研究背景與動機 1 1.2 文獻回顧 3 1.3 論文目標 5 1.4 論文組織 6 第二章 背景知識 7 2.1 人工神經網路(Artificial Neural Network, ANN) 7 2.2 卷積神經網路(Convolutional Neural Network, CNN) 10 2.2.1 LeNet 11 2.2.2 AlexNet 11 2.2.3 ResNet 12 2.3 自動編碼器(Autoencoder) 14 2.3.1 交換自動編碼器(Swapping Autoencoder) 15 2.4 生成對抗網路(Generative Adversarial Network, GAN) 16 2.4.1 PG-GAN 17 2.4.2 StyleGAN 18 第三章 基於風格生成對抗網路之高解析度換臉系統 19 3.1 人臉重現(Face Reenactment) 21 3.1.1 編碼器(Encoder) 22 3.1.2 解碼器(Decoder) 23 3.1.3 損失函數(Loss Function) 26 3.2 人臉合成(Face Merging) 29 3.3 訓練與參數設置 31 第四章 實驗結果與分析 32 4.1 實驗環境規格 32 4.2 影像品質評估標準 32 4.3 實驗資料集 35 4.4 實驗結果 35 4.4.1風格混和分析 36 4.4.2網路架構分析 37 4.4.3重建損失分析 38 4.4.4換臉成功結果 39 4.4.5換臉失敗結果 44 第五章 結論與未來研究方向 46 5.1 結論 46 5.2 未來研究方向 47 參考文獻 48

[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.
[2] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in Neural Information Processing Systems Workshop, 2017.
[3] Y. LeCun, and Y. Bengio, “Convolutional networks for images, speech, and time series,” in the Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10, 1995.
[4] T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
[6] “'Fast & Furious 7' will shoot scenes with doubles and replace Paul Walker with CGI to keep him in the film. The New York Daily News. March 21, 2014,” https://www.nydailynews.com/entertainment/gossip/confidential/fast-furious-7-double-time-walker-article-1.1728704, Accessed June 13, 2021
[7] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer, “The deepfake detection challenge (dfdc) preview dataset,” arXiv preprint arXiv:1910.08854, 2019.
[8] “faceswap,” https://github.com/deepfakes/faceswap. Accessed May 21, 2021.
[9] H.-X. Wang, C. Pan, H. Gong, and H.-Y. Wu, “Facial image composition based on active appearance model,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 893–896, 2008.
[10] D. Bitouk, N. Kumar, S. Dhillon, P. Belhumeur, and S. K. Nayar, “Face swapping: automatically replacing faces in photographs,” in ACM SIGGRAPH 2008 Papers, pp. 1–8, 2008.
[11] V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphable model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063–1074, 2003.
[12] V. Blanz, S. Romdhani, and T. Vetter, “Face identification across different poses and illuminations with a 3d morphable model,” in Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition. IEEE, pp. 202–207, 2002.
[13] A. Tuan Tran, T. Hassner, I. Masi, and G. Medioni, “Regressing robust and discriminative 3d morphable models with a very deep neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5163–5172, 2017.
[14] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187–194, 1999.
[15] V. Blanz, K. Scherbaum, T. Vetter, and H.-P. Seidel, “Exchanging faces in images,” in Computer Graphics Forum, vol. 23, no. 3. Wiley Online Library, pp. 669–676, 2004.
[16] O. Alexander, M. Rogers, W. Lambeth, M. Chiang, and P. Debevec, “Creating a photoreal digital actor: The digital emily project,” in 2009 Conference for Visual Media Production. IEEE, pp. 176–187, 2009.
[17] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, Face2face: Real-time face capture and reenactment of rgb videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395, 2016.
[18] X. Fan, Q. Peng, and M. Zhong, “3d face reconstruction from single 2d image based on robust facial feature points extraction and generic wire frame model,” in 2010 International Conference on Communications and Mobile Computing, vol. 3. IEEE, pp. 396–400, 2010.
[19] L. Tran and X. Liu, “Nonlinear 3d face morphable model,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7346–7355, 2018.
[20] Y. Nirkin, I. Masi, A. T. Tuan, T. Hassner, and G. Medioni, “On face segmentation, face swapping, and face perception,” in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, pp. 98–105, 2018.
[21] K. Dale, K. Sunkavalli, M. K. Johnson, D. Vlasic, W. Matusik, and H. Pfister, “Video face replacement,” in Proceedings of the 2011 SIGGRAPH Asia Conference, pp. 1–10. 2011.
[22] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[23] I. Korshunova, W. Shi, J. Dambre, and L. Theis, “Fast face-swap using convolutional neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3677–3685, 2017.
[24] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen, “Faceshifter: Towards high fidelity and occlusion aware face swapping,” arXiv preprint arXiv:1912.13457, 2019.
[25] J. Naruniec, L. Helminger, C. Schroers, and R. M. Weber, “High-resolution neural face swapping for visual effects,” in Computer Graphics Forum, vol. 39, no. 4. Wiley Online Library, pp. 173–184, 2020.
[26] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” in International Conference on Learning Representations, 2018.
[27] L. Zhang, X. Pan, H. Yang, and L. Li, “On open-set, high-fidelity and identity-specific face transformation,” IEEE Access, 2020.
[28] R. Chen, X. Chen, B. Ni, and Y. Ge, “Simswap: An efficient framework for high fidelity face swapping,” in Proceedings of the 28th ACM International Conference on Multimedia, pp. 2003–2011, 2020.
[29] I. Petrov, D. Gao, N. Chervoniy, K. Liu, S. Marangonda, C. Umé, J. Jiang, L. RP, S. Zhang, P. Wu, B. Zhou, and W. Zhang, “Deepfacelab: A simple, flexible and extensible face swapping framework,” arXiv preprint arXiv:2005.05535, 2020.
[30] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
[31] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125– 1134, 2017.
[32] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232, 2017.
[33] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for unpaired image-to-image translation,” in European Conference on Computer Vision. Springer, pp. 319–345, 2020.
[34] “faceswap-GAN,” https://github.com/shaoanlu/faceswap-GAN, Accessed May 21, 2021.
[35] Y. Nirkin, Y. Keller, and T. Hassner, “Fsgan: Subject agnostic face swapping and reenactment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7184–7193, 2019.
[36] T. T. Nguyen, C. M. Nguyen, D. T. Nguyen, D. T. Nguyen, and S. Nahavandi, “Deep learning for deepfakes creation and detection: A survey,” arXiv preprint arXiv:1909.11573, 2019.
[37] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” in Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[38] F. Rosenblatt, “Principles of neurodynamics. perceptrons and the theory of brain mechanisms,” Cornell Aeronautical Lab Inc Buffalo NY, Tech. Rep., 1961.
[39] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the International Conference on Machine Learning, 2010.
[40] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of the International Conference on Machine Learning, 2013.
[41] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” in Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[42] H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, pp. 400–407, 1951.
[43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
[44] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[45] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105, 2012.
[46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
[47] T. Park, J.-Y. Zhu, O. Wang, J. Lu, E. Shechtman, A. A. Efros, and R. Zhang, “Swapping autoencoder for deep image manipulation,” in Advances in Neural Information Processing Systems, 2020.
[48] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019.
[49] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119, 2020.
[50] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510, 2017.
[51] Y. Nitzan, A. Bermano, Y. Li, and D. Cohen-Or, “Face identity disentanglement via latent space mapping,” in ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–14, 2020.
[52] P. Pe´rez, M. Gangnet, and A. Blake, “Poisson image editing,” in ACM SIGGRAPH 2003 Papers, pp. 313–318, 2003.
[53] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, pp. 67–74, 2018.
[54] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” arXiv preprint arXiv:1606.03498, 2016.
[55] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 234–241, 2015.
[56] “face_landmark,” https://github.com/610265158/face_landmark, Accessed Janu-ary 13, 2021.
[57] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699, 2019.
[58] “face_recognition_TF2,” https://github.com/dmonterom/face_recognition_TF2, Accessed January 13, 2021
[59] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6629–6640, 2017.
[60] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” in IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[61] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
[62] X. Chen, C. Xu, X. Yang, and D. Tao, “Attention-gan for object transfiguration in wild images,” in Proceedings of the European Conference on Computer Vision, pp. 164–180, 2018.

無法下載圖示 全文公開日期 2026/07/26 (校內網路)
全文公開日期 2026/07/26 (校外網路)
全文公開日期 2026/07/26 (國家圖書館:臺灣博碩士論文系統)
QR CODE