簡易檢索 / 詳目顯示

研究生: 洪偉傑
Wei-Jie Hong
論文名稱: 重構臉型關注學習之單張大角度人臉重演
Recomposed Shape Attention Learning for One-Shot Large-Pose Face Reenactment
指導教授: 徐繼聖
Gee-Sern Hsu
口試委員: 鍾聖倫
Sheng-Luen Chung
陳祝嵩
Chu-Song Chen
林惠勇
Huei-Yung Lin
學位類別: 碩士
Master
系所名稱: 工程學院 - 機械工程系
Department of Mechanical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 61
中文關鍵詞: 人臉重演人臉辨識三維人臉模型
外文關鍵詞: Face Reenactment, Transformer, FLAME
相關次數: 點閱:147下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 我們提出重構臉型關注學習之單張大角度人臉重演模型 RSAL。相比於主流
    方法以 GAN 為主要生成器,我們引入了 Transformer 機制,提升在大角度及表
    情變化下身分保留。RSAL 模型由三個模組組成,分別為臉型重構編碼器(Shape
    Recomposition Encoder, SRE)、臉型強化轉換器(Shape Enhanced Transformer, SET)
    和注意力嵌入生成器(Attention-Embedded Generator, AEG)。首先 SRE 生成目標
    的臉型編碼,將來源人臉身份和參考人臉動作重新建構成目標臉型編碼。SET 從
    來源的人臉及臉型編碼中提取風格特徵碼。AEG 將目標臉型編碼和風格特徵碼
    作為輸入,產生出重演人臉,能夠生成高質量的人臉圖像。我們方法的優點之
    一是訓練機制,改進相關方法的不穩定性,大多用於微調(Finetune)少數樣本。
    但本方法可以使用單張來源人臉,並實現跨身分人臉重演。本方法在 MPIE-LP、
    VoxCeleb1[16] 和 VoxCeleb2-LP 資料庫與相關論文進行比較,結果證實本方法
    在單張大角度人臉重演上展現出優異競爭力。


    We propose Recomposed Shape Attention Learning (RSAL) for One-Shot
    Large-Pose Face Reenactment. Different from previous approaches that based on
    GAN for identity preservation during training, we introduce transformer mechanism
    to improve identity preservation across large poses. The RSAL model consists of
    three modules, namely the Shape Recomposition Encoder (SRE), the Shape
    Enhanced Transformer (SET) and the Attention-Embedded Generator (AEG). Given
    a source face and a reference face, the SRE generates the depth shape code that
    combine the source identity and reference action. The SET extracts style code from
    the source face and fused depth code. The AEG takes the fused depth code and style
    code as inputs to generate the desired reenacted face which show capable of
    producing high-quality facial images. The favorable properties of the approach is
    training mechanism, which improves weak controllability in the previous methods
    used to fine-tuned few images (few shot). Our method can use single source image
    (one-shot) and enable cross reenactment. We evaluate our approach on the MPIE-LP,
    VoxCeleb1, and VoxCeleb2-LP datasets. The large pose qualitative and quantitative
    results show that proposed approach produces reenacted faces better than
    state-of-the-art.

    目錄 摘要................................................................................2 Abstract............................................................................3 誌謝................................................................................4 目錄................................................................................5 圖目錄..............................................................................7 表目錄..............................................................................9 第 1 章 介紹.......................................................................10 1.1 研究背景和動機..................................................................10 1.2 方法概述.......................................................................13 1.3 論文貢獻.......................................................................14 1.4 論文架構.......................................................................16 第 2 章 文獻回顧...................................................................17 2.1 FLAME ........................................................................17 2.2 TransEditor...................................................................18 2.3 Style Transformer for Image Inversion and Editing.............................19 2.4 StyleSwin ....................................................................21 2.5 First Order Motion Model for Image Animation .................................21 2.6 HeadGAN.......................................................................22 2.7 Bi-layer......................................................................22 2.8 Face2Face.....................................................................22 第 3 章 主要方法...................................................................25 3.1 整體網路架構...................................................................26 3.2 角度適應編碼器設計.............................................................27 3.3 臉型重構編碼器設計.............................................................29 3.4 臉型強化轉換器和注意力嵌入生成器設計.............................................29 第 4 章 實驗設置與分析..............................................................33 4.1 資料庫介紹.....................................................................33 4.1.1 Multi-LP....................................................................33 4.1.2 VoxCeleb1 ..................................................................36 4.1.3 VoxCeleb2-LP ...............................................................36 4.2 實驗設置.......................................................................37 4.2.1 資料劃分、設置...............................................................37 4.2.2 效能評估指標.................................................................38 4.2.3 實驗設計.....................................................................40 4.3 實驗結果與分析.................................................................43 4.3.1 臉型強化轉換器之設置比較......................................................43 4.3.2 FLAME 特徵之影響 ............................................................45 4.3.3 探討身分損失函數比較..........................................................46 4.3.4 探討自注意力機制.............................................................46 4.3.5 探討重演人臉微調張數比較......................................................46 4.4 與相關文獻之比較...............................................................50 第 5 章 結論與未來研究方向..........................................................54 第 6 章 參考文獻....................................................................55

    [1] Tianye, et al. Learning a model of facial shape and expression from 4D scans. ACM
    Trans. Graph., 2017
    [2] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou.
    Facewarehouse: A 3d facial expression database for visual computing. IEEE
    Transactions on Visualization and Computer Graphics, 20(3):413–425, 2013.
    [3] Jie Cao, Yibo Hu, Hongwen Zhang, Ran He, and Zhenan Sun. Towards high
    fidelity face frontalization in the wild. In IJCV, 2020.
    [4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset
    for recognising faces across pose and age. In FG, 2018.
    [5] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and
    Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain
    image-to-image translation. In CVPR, 2018.
    [6] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse
    image synthesis for multiple domains. In CVPR, 2020.
    [7] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep
    speaker recognition. In INTERSPEECH, 2018.
    [8] Deng, Jiankang, et al. "Arcface: Additive angular margin loss for deep face
    recognition." Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition. 2019.
    [9] Michail Christos Doukas, Stefanos Zafeiriou, and Viktoriia Sharmanska.
    Headgan: One-shot neural head synthesis and editing. In IEEE/CVF
    International Conference on Com- puter Vision (ICCV), 2021.
    [10] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker.
    Multi-pie. Image and Vision Computing, 2010.
    [11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and
    Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a
    local nash equilibrium. In NIPS, 2017.
    [12] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with
    adaptive instance normalization. In ICCV, 2017.
    [13] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time
    style transfer and super-resolution. In European conference on computer vision,
    pages 694–711. Springer, 2016.
    [14] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture
    for generative adversarial networks. In CVPR, 2019.
    [15] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
    arXiv preprint arXiv:1412.6980, 2014.
    [16] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a
    large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612,
    2017.
    [17] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
    Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer.
    Automatic differentiation in pytorch. 2017.
    [18] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas
    Vetter. A 3d face model for pose and illumination invariant face recognition. In
    2009 sixth IEEE international conference on advanced video and signal based
    surveillance, pages 296–301. Ieee, 2009.
    [19] Aliaksandr Siarohin, St ́ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and
    Nicu Sebe. First order motion model for image animation. In NIPS, 2019.
    [20] Christian Szegedy, Vincent Vanhoucke, Serget Loffe, and Zbigniew Wojna.
    Rethinking the inception architecture for computer vision. arXiv preprint
    arXiv:1512.00567, 2015.
    [21] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan
    Catanzaro. Few-shot video-to-video synthesis. In Conference on Neural
    Information Processing Systems (NeurIPS), 2019.
    [22] Olivia Wiles, A Sophia Koepke, and Andrew Zisserman. X2face: A network for
    controlling face generation using images, audio, and pose codes. In ECCV, 2018.
    [23] Wayne Wu, Yunxuan Zhang, Cheng Li, Chen Qian, and Chen Change Loy.
    Reenactgan: Learning to reenact faces via boundary transfer. In ECCV, 2018.
    [24] Guangming Yao, Yi Yian, Tianjia Shao, and Kun Zhou. Mesh guided one-shot
    face reenactment using graph convolutional networks. arXiv preprint
    arXiv:2008.07783, 2020.
    [25] Egor Zakharov, Aleksei Ivakhnenko, Aliaksandra Shysheya, and Victor
    Lempitsky. Fast bi-layer neural synthesis of one- shot realistic head avatars. In
    European Conference of Computer vision (ECCV), August 2020.
    [26] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky.
    Few-shot adversarial learning of realis- tic neural talking head models. In ICCV,
    2019.
    [27] Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong
    Liu, Yu Ding, and Changjie Fan. Freenet: Multi-identity face reenactment. In
    CVPR, 2020.
    [28] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang.
    The unreasonable effectiveness of deep features as a perceptual metric. In
    Proceedings of the IEEE conference on computer vision and pattern recognition,
    pages 586–595, 2018.
    [29] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face
    alignment across large poses: A 3d solution. In Proceedings of the IEEE
    conference on computer vision and pattern recognition, 2016.
    [30] XU, Yanbo, et al. Transeditor: Transformer-based dual-space gan for highly
    controllable facial editing. In: Proceedings of the IEEE/CVF Conference on
    Computer Vision and Pattern Recognition. 2022
    [31] HU, Xueqi, et al. Style transformer for image inversion and editing.
    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition. 2022.
    [32] ZHANG, Bowen, et al. Styleswin: Transformer-based gan for high-resolution
    image generation. In: Proceedings of the IEEE/CVF conference on computer
    vision and pattern recognition. 2022.
    [33] LIU, Ze, et al. Swin transformer: Hierarchical vision transformer using shifted
    windows. In: Proceedings of the IEEE/CVF international conference on
    computer vision. 2021
    [34] YANG, Kewei, et al. Face2Face ρ: Real-Time High-Resolution One-Shot Face
    Reenactment. In: European conference on computer vision. Cham: Springer
    Nature Switzerland, 2022.
    [35] HSU, Gee-Sern; TSAI, Chun-Hung; WU, Hung-Yi. Dual-generator face
    reenactment. In: Proceedings of the IEEE/CVF Conference on Computer Vision
    and Pattern Recognition. 2022.
    [36] Vggface2: A dataset for recognising faces across pose and age. In: 2018 13th
    IEEE international conference on automatic face & gesture recognition (FG
    2018). IEEE, 2018.
    [37] MENG, Qiang, et al. Magface: A universal representation for face recognition
    and quality assessment. In: Proceedings of the IEEE/CVF conference on
    computer vision and pattern recognition. 2021
    [38] LOPER, Matthew, et al. SMPL: A skinned multi-person linear model. ACM transactions
    on graphics (TOG), 2015

    無法下載圖示 全文公開日期 2024/08/10 (校內網路)
    全文公開日期 2024/08/10 (校外網路)
    全文公開日期 2024/08/10 (國家圖書館:臺灣博碩士論文系統)
    QR CODE