簡易檢索 / 詳目顯示

研究生: 劉至軒
Jhih-Syuan Liu
論文名稱: 一個基於Transformer混合模型的3D人臉重建系統
A 3D Face Reconstruction System Based on a Transformer Hybrid Model
指導教授: 范欽雄
Chin-Shyurng Fahn
口試委員: 繆紹綱
王榮華
馮輝文
范欽雄
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 55
中文關鍵詞: 3D 臉部重建人臉對齊人臉偵測 Transformer模型FLAME人臉模型
外文關鍵詞: 3D face reconstruction, face alignment, face detection, Transformer model, FLAME model
相關次數: 點閱:340下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人臉重建的研究一直以來受到極大的關注,從最初使用 3DMM 的最佳化參數方法,到近幾年深度學習的快速發展,人臉重建在深度學習的應用上也越來越備受關注,應用的方面也越來越廣泛,例如VR/AR的使用、人臉影像編輯、人臉辨識和虛擬的人臉化妝等等。其中在深度學習上,許多的研究使用CNN當作重建人臉任務中的主要模型,效果良好並且可以根據圖像重建準確的3D人臉。
    而近幾年由於Transformer的興起,許多領域上的研究也在探討使用Transformer模型替代CNN模型,以獲得更優秀的結果。本篇論文的目標在探討使用Transformer混合模型,在人臉重建的領域上,根據圖像重建出更準確的3D人臉。
    目前,弱監督重建人臉的方式,使用CNN模型提取圖像特徵後,根據一維的圖像特徵回歸人臉相關的參數,再搭配統計人臉模型重建3D人臉。但只用一維的特徵會導致缺失許多的重要資訊。本文提出使用 Transformer 混合模型搭配多層的 patch tokens 和使用多個 class tokens 去學習更多的特徵,使重建人臉時可以讓模型更多的資訊下,學習並取得更多的重要特徵,再搭配FLAME model以重建準確的人臉模型。
    在實驗結果部分,我們透過Scan-to-Mesh、Chamfer Distance、Mean Normal Error和Complete Rate 四個評估方法和指標上,我們和其他模型相比有較好的結果。在視覺呈現上,我們使用多個不同的圖片來呈現結果,並且附上heatmap圖片以驗證生成出來人臉模型的準確度。在消融研究上,我們根據是否使用多個class tokens 和使用多個patch tokens 來驗證我們的實驗結果。最後我們確認,在使用多個class tokens和多層patch tokens 下,我們的結果是有效的並且獲得了更準確的3D人臉重建。


    The research on facial reconstruction has always garnered significant attention, evolving from the early optimization parameter methods using 3DMM to the rapid advancements in deep learning in recent years. The application of facial reconstruction in deep learning has become increasingly prominent, with a wide range of uses such as VR/AR applications, facial image editing, facial recognition, and virtual facial makeup. In the realm of deep learning, many studies have utilized CNN as the primary model for facial reconstruction tasks, achieving good results and accurately reconstructing 3D faces from images.
    In recent years, with the rise of Transformers, research in various fields has been exploring the use of Transformer models to replace CNN models to achieve better outcomes. This thesis aims to explore the use of a hybrid Transformer model in the field of facial reconstruction, reconstructing more accurate 3D faces from images.
    Currently, weakly supervised face reconstruction methods use CNN models to extract image features, then regress facial parameters based on these one-dimensional image features, and finally reconstruct 3D faces using statistical facial models. However, relying solely on one-dimensional features leads to the loss of much important information. This Thesis proposes using a Transformer hybrid model that incorporates multi-layer patch tokens and multiple class tokens to learn more features. This allows the model to access more information and capture more important features during face reconstruction. The FLAME model is then used to reconstruct accurate facial models.
    In the experimental results, we achieved better outcomes compared to other models using four evaluation methods and metrics: Scan-to-Mesh, Chamfer Distance, Mean Normal Error, and Complete Rate. For visual presentation, we used multiple different images to display the results, along with heatmap images to verify the accuracy of the generated facial models. In the ablation study, we validated our experimental results based on the use of multiple class tokens and multiple patch tokens. Finally, we confirmed that with the use of multiple class tokens and multi-layer patch tokens, our results were effective and achieved more accurate 3D face reconstruction.

    中文摘要 i Abstract ii 中文致謝 iv List of Figures vii List of Tables viii Chapter 1 Introduction 1 1.1 Overview 1 1.2 Motivation 3 1.3 System Description 4 1.4 Thesis Organization 5 Chapter 2 Related Work 6 2.1 3D Morphable Model 6 2.1.1 Basel face model 7 2.1.2 FLAME model 8 2.2 Face Reconstruction 10 Chapter 3 Our Face Reconstruction Model 13 3.1 Data Preprocessing 13 3.2 3D Face Reconstruction Model 14 3.2.1 Encoder architecture 15 3.2.2 Decoder architecture 18 3.3 Loss Function 19 3.3.1 Landmarks loss 19 3.3.2 Eyelid loss and lips loss 19 3.3.3 Photometric loss 20 3.3.4 Identity loss 21 3.3.5 Shape consistency loss 21 3.3.6 Regularization 23 3.3.7 Overall loss function 23 Chapter 4 Experimental Results and Discussion 24 4.1 Experimental Environment Setup 24 4.2 Dataset Description 25 4.2.1 VGGFace2 dataset 26 4.2.2 BUPT-Balancedface dataset 26 4.2.3 FaceScape dataset 27 4.2.4 NoW dataset 28 4.3 The Results of Our Face Reconstruction 28 4.3.1 Evaluation metrics 29 4.3.2 Training on our face reconstruction system 31 4.3.3 Reconstruction Performance Comparison of our model and the others 32 4.3.4 Visual comparison with other models 34 4.3.5 Bad performance examples 44 4.4 Ablation Study 46 Chapter 5 Conclusions and Future Work 49 5.1 Conclusions 49 5.2 Future Work 50 References 52

    [1] V. Blanz and, T. Vetter “Estimating coloured 3D face models from single images: An example based approach,” in Proceedings of the European Conference on Computer Vision, Freiburg, Germany, 1998, pp. 499-513.

    [2] L. Hu, S. Saito, L. Wei, K. Nagano, J. Seo, J. Fursund, I. Sadeghi, C. Sun, Y.-C. Chen, and H. Li, “Avatar Digitization from a Single Image for Real-time Rendering,” ACM Transactions on Graphics, vol. 36, no. 6, pp. 1-14, 2017.

    [3] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, California, 1999, pp. 187-194.

    [4] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2Face: Real-time face capture and reenactment of RGB videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 2387-2395.

    [5] P. Ghosh, P. S. Gupta, R. Uziel, A. Ranjan, M. Black, and T. Bolkart, “GIF: Generative Interpretable Faces,” in Proceedings of the International Conference on 3D Vision, Fukuoka, Japan, 2020, pp. 868-878

    [6] A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H.-P. Seidel, P. Pérez, M. Zöllhofer, and C. Theobalt, “StyleRig: Rigging StyleGAN for 3D Control over Portrait Images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Washington, 2020, pp. 6141-6150.

    [7] V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,“ IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063–1074, 2003.

    [8] A. T. Tran, T. Hassner, I. Masi, and G. Medioni, “Regressing robust and discriminative 3D morphable models with a very deep neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 1493–1502.

    [9] S. Zulqarnain Gilani and A. Mian, “Learning from millions of 3D scans for large-scale 3D face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp. 1896–1905.

    [10] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3D shape regression for real-time facial animation,” ACM Transactions on Graphics, vol. 32, no. 4, pp. 41, 2013.

    [11] D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan and M. J. Black, “Capture, Learning, and Synthesis of 3D Speaking Styles,” in Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 10093-10103

    [12] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporating Convolution Designs into Visual Transformers,” in Proceedings of the International Conference on Computer Vision, Montreal, Canada, 2021, pp. 559-568

    [13] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks),” in Proceedings of the International Conference on Computer Vision, Venice, Italy, 2017, pp. 1021-1030.

    [14] Y. Nirkin, I. Masi, A. Tran Tuan, T. Hassner, and G. Medioni, “On Face Segmentation, Face Swapping, and Face Perception,” in Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, Xi'an, China, 2018, pp. 98-105.

    [15] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3D Face Model for Pose and Illumination Invariant Face Recognition,” in Proceedings of the Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, Genova, Italy,2009, pp. 296-301.

    [16] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,” ACM Transactions on Graphics, vol. 36, no. 6, pp. 1-17, 2017.

    [17] D. Vlasic, M. Brand, H. Pfister, and J. Popović, “Face transfer with multilinear models,” ACM Transactions on Graphics, vol. 24, no. 3, pp. 426-433, 2005

    [18] C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression regression for real-time facial tracking and animation,” ACM Transactions on Graphics, vol. 33, no. 4, pp. 1-10, 2014

    [19] T. Hassner, “Viewing Real-World Faces in 3D,” in Proceedings of the IEEE International Conference on Computer Vision, Sydney, New South Wales, 2013, pp. 3607-3614.

    [20] T. Hassner and R. Basri, “Example Based 3D Reconstruction from Single 2D Images,” in Proceedings of the Conference on Computer Vision and Pattern Recognition Workshop, New York, 2006, pp. 15-15.

    [21] I. Kemelmacher-Shlizerman and S. M. Seitz, “Face reconstruction in the wild,” in Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 2011, pp. 1746-1753

    [22] O. Aldrian and W. A. P. Smith, “Inverse Rendering of Faces with a 3D Morphable Model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1080-1093, 2013

    [23] F.-J. Chang, A. T. Tuan, T. Hassner, I. Masi, R. Nevatia, and G. Medioni, “ExpNet: Landmark-Free, Deep, 3D Facial Expressions,” in Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, Xi'an, China, 2018, pp. 122-129.

    [24] Y. Deng, J. Yang, S. Xu, D. Chen, Y. Jia, and X. Tong, “Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, California, 2019, pp. 285-295.

    [25] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. T. Freeman, “Unsupervised Training for 3D Morphable Model Regression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp. 8377-8386.

    [26] P. Dou, S. K. Shah, and I. A. Kakadiaris, “End-to-End 3D Face Reconstruction with Deep Neural Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 1503-1512

    [27] Y. Guo, J. Z. Zhang, J. Cai, B. Jiang, and J. Zheng, “CNN-based real-time dense face reconstruction with inverse-rendered photo-realistic face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, pp. 2893-2906, 2018.

    [28] E. Richardson, M. Sela, R. Or-El, and R. Kimmel, “Learning detailed face reconstruction from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 5553-5562.

    [29] E. Richardson, M. Sela, and R. Kimmel, “3D Face Reconstruction by Learning from Synthetic Data,” in Proceedings of the Fourth International Conference on 3D Vision, Los Alamitos, California, 2016, pp. 460-469.

    [30] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face alignment across large poses: A 3D solution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 146-155.

    [31] M. Sela, E. Richardson, and R. Kimmel, “Unrestricted facial geometry reconstruction using image-to-image translation,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 1585-1594.

    [32] A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Pérez, and C. Theobalt, “MoFa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction,” in Proceedings of the International Conference on Computer Vision, Venice, Italy, 2017, pp. 1274-1283.

    [33] L. Tran and X. Liu, “Nonlinear 3D face morphable model,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Los Alamitos, California, 2018, pp. 7346-7355.

    [34] Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an Animatable Detailed 3D Face Model from In-The-Wild Images,” ACM Transactions on Graphics, vol. 40, no. 4, pp. 1-13, 2021.

    [35] S. Sanyal, T. Bolkart, H. Feng, and M. J. Black, “Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 7755-7764.

    [36] H. Zhu, H. Yang, L. Guo, Y. Zhang, Y. Wang, M. Huang, M. Wu, Q. Shen, R. Yang, and X. Cao, “FaceScape: 3D Facial Dataset and Benchmark for Single-View 3D Face Reconstruction,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 45, no. 12, pp. 14528-14545, 2023.

    無法下載圖示 全文公開日期 2029/07/18 (校內網路)
    全文公開日期 2034/07/18 (校外網路)
    全文公開日期 2034/07/18 (國家圖書館:臺灣博碩士論文系統)
    QR CODE