Basic Search / Detailed Display

Author: 陳思杰
Szu-Chieh Chen
Thesis Title: 一個結合條件式生成對抗網路與條件式變分自動編碼器的人臉表情影像之生成系統
A Facial Expression Image Generation System Based on Conditional Generative Adversarial Network Combining with Conditional Variational Auto Encoder
Advisor: 范欽雄
Chin-Shyurng Fahn
Committee: 林啟芳
Chi-Fang Lin
陳彥霖
Yen-Lin Chen
吳怡樂
Yi-Leh Wu
范欽雄
Chin-Shyurng Fahn
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2022
Graduation Academic Year: 110
Language: 英文
Pages: 47
Keywords (in Chinese): 人臉表情影像生成影像轉譯條件式生成對抗網路條件式變分自動編碼器人臉表情影像形變情緒程度人臉影像變換
Keywords (in other languages): Facial expression image generation, Image translation, Conditional generative adversarial network, Conditional variational auto encoder, Facial expression image morphing, Emotional level, Face image change
Reference times: Clicks: 348Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 表達是人與人交流的第一扇門,如何解讀人類的表情,對於醫學或心理學都有很大的幫助。在現實世界中,尤其是人臉的數據更難獲取,因此,考慮到人臉識別的各種需求,我們將研究應用到二維人臉表情影像的生成上。目前對抗性生成網絡已經成熟,但在影像轉譯 (Image translation) 的領域還不夠完善;針對這些原因,我們提出了一種較為新穎的臉部表情影像生成方法。
    在研究期間,我們探討了條件式生成對抗網路(Conditional generative adversarial network; cGAN)與條件式變分自動編碼器(Conditional variational auto encoder; cVAE)的不同;基於cGAN的生成對抗網路,可以從資料集裡學習樣本的分佈,並且在模型的構造引入了條件變數(Condition variable),進而實現對真實資料的逼近,而cVAE有加入一些雜訊到自動編碼器,它是透過常態分佈的抽樣來產生結果,本質上,雖然cVAE生成的影像質量較cGAN差,但其實更具多樣化。在本論文中,我們為了獲得cVAE與cGAN的優點,進一步改進了cVAE-GAN的模型,並開發了更多新穎的技術,包括:如何提取U-Net模型的精髓,也比較現有文獻中,近幾年較先進的模型;在我們所修改後的cVAE-GAN模型裡,只需要輸入一張人臉影像,即可生成各種不同的面部表情,其中,使用分類器和對抗性損失函數,以及自動編碼器,這都有助於提高生成影像的多樣性和影像質量。在資料集的選擇上,我們採用了Fer-2013資料集,它們有上萬張人臉影像可供訓練;我們的生成模型的建立分為兩個階段:在訓練階段中,透過批量來訓練資料集,並對其進行資料前處理,而在推理階段,透過已訓練好的模型,來生成七種人臉表情影像,包含:憤怒、厭惡、恐懼、快樂、傷心、驚訝與無表情。
    此外,我們還將展示不同時期的訓練細節,其中,使用各種指標評估每個 GAN 生成的人臉表情的性能,例如:FID、IS、PSNR 和 SSIM,不僅如此,我們也會透過消融實驗,透過一些評估指標,來檢驗我們模型的參數,藉此印證是否有達到更好的效能,根據實驗結果,與原始模型相比,經過消融實驗的調整,我們的模型在七種面部表情的平均 FID 得分上,從97.60改進到了39.86,這是令人鼓舞的表現。值得一提地,我們的模型還可以實現更多的應用,比如:不同人臉表情影像之間的形變、同一人臉表情影像形變成不同的情緒程度,或不同人臉影像之間的變臉。最後,我們將這些應用所生成的人臉表情影像,與其它的模型,諸如:CDAAE、cVAE-GAN與cCycleGAN生成的人臉表情影像進行比較,實驗結果顯示我們的模型出乎意料地具有更好的性能。


    In the real world, facial expression is the first door of human-to-human communication. How to interpret human facial expression is of great help to medicine or psychology. Especially, face data are more difficult to obtain. Therefore, considering the various needs for face recognition, we will apply the research to the generation of two-dimensional facial expression images. At present, adversarial generative networks are mature but not perfect in the field of image translation; for these reasons, we propose a relatively novel facial expression generation method.
    During the research, we have investigated the difference between conditional generative adversarial network (cGAN) and conditional variational auto encoder (cVAE). Based on the cGAN, the generative adversarial network can learn the distribution of samples from the dataset. And a condition variable is incorporated into the model that tries to achieve the approximation of real data; as for the cVAE, some noises are added to the autoencoder that produces results through sampling from a normal distribution. Essentially, the cVAE is more diverse than the cGAN, although the former causes poor image quality. To acquire the advantages of cVAE and cGAN, we further improve the cVAE-GAN and develop more novel techniques, including how to extract the essence of the U-Net model.
    The performance of our modified cVAE-GAN model will be compared to those of state-of- the-art models in the existing literature recently, which only need input a face image to generate various facial expression images. To achieve this, we employ the classifier adversarial loss function, and autoencoder that all help to increase the variety and quality of the generated images. In the selection of data sets, we used Fer-2013 dataset, which have tens of thousands of face images for training. The building of our generative model is divided into two phases. In the training phase, we train the data sets through batches, and perform data preprocessing on them; in the inference phase, the pre-trained model tries to generate seven kinds of facial expressions, including anger, disgust, fear, happiness, sadness, surprise, and neutral.
    In addition, we will show the training details in different epochs to evaluate the performance of each GAN-generated facial expression using various metrics, such as FID, IS, PSNR, and SSIM. Not only that, but we also test the parameters of our model through ablation studies, and multiple evaluation metrics are used to verify whether better performance is achieved. According to experimental results, compared with the original model, it is encouraged that the performance of our model is improved from 97.60 to 39.86 in the average FID of seven kinds of facial expressions after the adjustment of the ablation studies. It is worth mentioning that our model can also achieve more applications; for example, different facial expression images morphing, the same facial expression image morphing by different emotional levels, and different face images change of the same facial expression. Finally, we generate the facial expression images from the above applications, and compare them with those by other models, such as CDAAE, cVAE-GAN and cCycleGAN. The experimental results reveal that our model has better performance unexpectedly.

    中文摘要 i Abstract iii 中文致謝 v Contents vi List of Figures viii List of Tables x Chapter 1 Introduction 1 1.1 Overview 1 1.2 Motivation 1 1.3 System Description 2 1.4 Thesis Organization 3 Chapter 2 Related Work 4 2.1 Exploration of Two Facial Expression Image Generation Methods 4 2.1.1 The landmark-based approaches 4 2.1.2 The pixel-by-pixel approaches 5 2.2 In-depth Exploration of cGAN, cVAE and cVAE-GAN 6 2.2.1 Conditional generative adversarial networks 6 2.2.2 Conditional variational autoencoders 7 2.2.3 Conditional variational autoencoders with generative adversarial networks 8 2.3 Exploration of the Ensemble of Autoencoder 9 2.3.1 Cross-domain adversarial autoencoder 9 2.3.2 U-Net 10 Chapter 3 Our Facial Expression Image Generation Model 12 3.1 Data Pre-processing 12 3.2 cGAN Mixed with cVAE 13 3.2.1 Generator 13 3.2.2 Discriminator and classifier 14 3.3 The Facial Expression Image Objectives 15 3.3.1 Loss function 15 3.3.2 Model optimizer 17 Chapter 4 Experimental Results and Discussion 18 4.1 Dataset of Facial Expression Images 18 4.1.1 The Fer-2013 dataset 18 4.1.2 Training detail in the Fer-2013 dataset 19 4.2 Training on Our Facial Expression Image Generation Model 21 4.2.1 Seven kinds of facial expression images generation 21 4.2.2 Different emotional level facial expression images generation 22 4.3 Testing on Our Facial Expression Image Generation Model 23 4.3.1 Comparison to other models using objective metrics 23 4.3.2 Comparison to other models with visual appearance 25 4.4 Ablation Studies 25 4.4.1 Model parameter and layers concatenation adjusting 25 4.4.2 Model performance 26 4.5 Related Applications 28 4.5.1 Different facial expression images morphing 29 4.5.2 Same facial expression images morphing by different emotional levels 30 4.5.3 Different face images change of the different facial expression 31 Chapter 5 Conclusions and Future Work 33 5.1 Conclusions 33 5.2 Future Work 34 References 35

    [1] D. Keltner, “Ekman, emotional expression, and the art of empirical epiphany,” Journal of Research in Personality, vol. 38, no. 1, pp. 37-44, 2004.
    [2] G. Tesei, “Generating Realistic Facial Expressions through Conditional Cycle-Consistent Generative Adversarial Networks (CCycleGAN),” in Proceedings of the RIIAA Conference Submission, Mexico City, Mexico, 2019, pp. 1-7.
    [3] Y. Zhou and B. E. Shi, “Photorealistic Facial Expression Synthesis by the Conditional Difference Adversarial Autoencoder,” arXiv preprint arXiv:1708.09126, 2017.
    [4] O. Ronneberger, P. Fischer and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” arXiv preprint arXiv:1505.04597, 2015.
    [5] M. Mirza and S. O. Subramanyam, “Conditional Generative Adversarial Nets,” arXiv preprint arXiv:1411.1784, 2014.
    [6] Y. Huang and S. M. Khan, “DyadGAN: Generating Facial Expressions in Dyadic Interactions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 11-18.
    [7] D. Liu, Y. Yang and X. Jing, “Multiple Facial Expressions Synthesis Driven by Editable Line Maps,” in Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Toronto, Canada, 2020, pp. 1645-1650.
    [8] S. Bazrafkan, H. Javidnia and P. Corcoran, “Face Synthesis with Landmark Points from Generative Adversarial Networks and Inverse Latent Space Mapping,” arXiv preprint arXiv:1802.00390, 2018.
    [9] Y. Xia et al., “Local and Global Perception Generative Adversarial Network for Facial Expression Synthesis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1443-1452, 2021.
    [10] D. Kollias et al. “Deep Neural Network Augmentation: Generating Faces for Affect Analysis.” International Journal of Computer Vision, vol. 128, no. 5, pp. 1455-1484, 2020.
    [11] Y. Shin et al., “BCGAN: Facial Expression Synthesis by Bottleneck-Layered Conditional Generative Adversarial Networks,” in Proceedings of the 15th International Conference on Ubiquitous Information Management and Communication, Seoul, South Korea, 2021, pp. 1-4.
    [12] S. Zhao et al. “EmotionGAN: Unsupervised Domain Adaptation for Learning Discrete Probability Distributions of Image Emotions,” in Proceedings of the 26th ACM international conference on Multimedia, New York, 2018, pp. 1319-1327.
    [13] T. Karras, et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” in Proceedings of the ICLR 2018 Conference Blind Submission, Vancouver, Canada, 2018, pp. 1-26.
    [14] C. Zhang, R. Barbano and B. Jin, “Conditional Variational Autoencoder for Learned Image Reconstruction.” MDPI of Computation, vol. 9, no. 11, pp. 114-137, 2021.
    [15] J. Bao et al., “cVAE-GAN: Fine-Grained Image Generation through Asymmetric Training,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2764-2773.
    [16] Y. Liu et al., “Exploring Disentangled Feature Representation Beyond Face Identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp. 2080-2089.
    [17] K. Kansal and A.V. Subramanyam, “Autoencoder Ensemble for Person Re-Identification,” in Proceedings of the IEEE 5th International Conference on Multimedia Big Data, Singapore, 2019, pp. 257-261.
    [18] A. Tamrakar et al., “Evaluation of low-level features and their combinations for complex event detection in open source videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, Rhode Island, 2012, pp. 3681-3688.
    [19] H. Hou, J. Huo and Yang Gao, “Cross-Domain Adversarial Auto-Encoder,” arXiv preprint arXiv:1804.06078, 2018.
    [20] A. Creswell et al., “Adversarial Information Factorization,” arXiv preprint arXiv:1711.05175, 2017.
    [21] T. DeVries et al., “On the Evaluation of Conditional GANs," in Proceedings of the ICLR 2020 Conference Blind Submission, Addis Ababa, Ethiopia, 2020, pp. 1-23.
    [22] S. Barratt and R. Sharma, “A Note on the Inception Score,” arXiv preprint arXiv:1801.01973, 2018.

    無法下載圖示 Full text public date 2027/07/21 (Intranet public)
    Full text public date 2032/07/21 (Internet public)
    Full text public date 2032/07/21 (National library)
    QR CODE