研究生: |
郭靜宜 Ching-Yi Kuo |
---|---|
論文名稱: |
基於深度學習之人臉法向量貼圖生成研究 A Study on Facial Normal Map Generation based on Deep Learning |
指導教授: |
林宗翰
Tzung-Han Lin |
口試委員: |
歐立成
Li-Chen Ou 孫沛立 Pei-Li Sun 胡國瑞 Kuo-Jui Hu |
學位類別: |
碩士 Master |
系所名稱: |
應用科技學院 - 色彩與照明科技研究所 Graduate Institute of Color and Illumination Technology |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 72 |
中文關鍵詞: | 人臉法向量貼圖 、人臉特徵點 、卷積神經網路 、U-net 、立體光度法 、光影重建 |
外文關鍵詞: | Facial normal map, Facial landmark, Convolutional neural network(CNN), U-net, Photometric stereo, Image relighting |
相關次數: | 點閱:355 下載:6 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於人臉的複雜性、多樣性以及攝影條件的不確定性,從單張或多張人臉圖像中生成高品質的法向量貼圖是一個具有挑戰性的任務。
本研究旨在提出一個基於卷積神經網路U-net的模型架構,使用者只需輸入一張正面且光源均勻的人臉彩色影像到本模型中,即可生成尺寸為512*512的人臉法向量貼圖。
本研究採用了基於立體光度法的照相系統,以蒐集訓練所需的人臉影像資料,總共拍攝了32組訓練資料與6組測試資料,此外,還使用了手機多加拍攝2組測試資料。為了增加資料多樣性,本研究利用光影重建方法使訓練資料從1張影像擴增至16,128張影像,同時,利用人臉特徵點對人臉區域進行分割並編號,將人臉區域編號資料疊加在影像中,期望模型能學習人臉的不同部位,最後將訓練資料進行分組,分別訓練出十個模型。
實驗中,本研究使用評估指標餘弦相似度(Cosine similarity)和結構相似度(SSIM)比較真實法向量貼圖與模型生成的法向量貼圖,來評估模型性能。實驗結果發現,對於使用照相系統拍攝的測試資料來說,將訓練資料擴增至2,304張所訓練出來的模型表現最佳。而對於使用手機拍攝的測試資料來說,將訓練資料擴增至16,128張所訓練出來的模型表現最佳。這說這證明了本研究的資料擴增方法有助於增加訓練資料的多樣性並提升模型性能。另外,實驗結果顯示,對於使用照相系統拍攝的測試資料來說,僅使用不加入人臉區域編號資料之影像就能獲得良好結果。而對於使用手機拍攝的測試資料來說,加入人臉區域編號資料所得到的結果較穩定。
最後,本研究注意到輸入的人臉影像若是頭髮較為雜亂或遮住某些部位,會影響本研究模型的結果品質。因此,在本研究中,要求輸入的影像符合正面拍攝、整理乾淨的臉部、綁起頭髮或固定在耳後,且不戴眼鏡等限制。
Due to the complexity, diversity, and uncertainty of photography conditions for faces, generating high-quality normal maps from single or multiple face images is a challenging task.
This study aims to propose a model architecture based on the convolutional neural network U-net. Users only need to input a front-facing and uniform light source color image into the model to generate a 512x512-sized face normal map.
We employed a photography system based on photometric stereo to collect the required face image data for training. A total of 32 training datasets and 6 test datasets were captured, and additionally, 2 test datasets were captured using mobile phones. To increase data diversity, we relit the images to augment the training data from 1 image to 16,128 images. Additionally, we segmented and numbered face regions using facial landmarks, concatenating the numbering data and the images to enable the model to learn different facial parts. The training data was then grouped to train ten separate models.
In the experiments, we used metrics such as Cosine similarity and Structural Similarity Index (SSIM) to compare the real normal maps with the generated ones to evaluate the models’ performance. The results showed that for test data captured using the photography system, the model trained with 2,304 augmented images performed the best. For test data captured using mobile phones, the model trained with 16,128 augmented images performed the best. This demonstrates that our data augmentation method improves data diversity and enhances model performance. Furthermore, the experiments showed that for test data captured using the photography system, good results were achieved even without using numbering data. However, for test data captured using mobile phones, the results were more stable when numbering data was used.
Finally, we observed that the quality of the models’ results was affected when input face images have messy hair or covered certain parts. Therefore, in this study, we required input images to meet certain restrictions, including front-facing, clean facial appearance, tied-up hair, or hair fixed behind the ears, and not wearing glasses.
[1]O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015, pp. 234–241. doi: 10.1007/978-3-319-24574-4_28.
[2]H.-Y. Chang and T.-H. Lin, “Portrait Imaging Relighting System based on A Simplified Photometric Stereo Method,” Applied Optics, vol. 61, no. 15, pp. 4379–4386, May 2022, doi: 10.1364/AO.451662.
[3]M. Okabe, G. Zeng, Y. Matsushita, T. Igarashi, L. Quan, and H. Y. Shum, “Single-View Relighting with Normal Map Painting,” in Proceedings of Pacific Graphics, pages 27–34, 2006.
[4]D. Shahlaei and V. Blanz, “Realistic Inverse Lighting from a Single 2D Image of a Face, Taken Under Unknown and Complex Lighting,” in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), May 2015, pp. 1–8. doi: 10.1109/FG.2015.7163128.
[5]F. Solomon and K. Ikeuchi, “Extracting the Shape and Roughness of Specular Lobe Objects Using Four Light Photometric Stereo,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 4, pp. 449–454, Apr. 1996, doi: 10.1109/34.491627.
[6]A. E. Gendy and A. Shalaby, “Mean Profile Depth of Pavement Surface Macrotexture Using Photometric Stereo Techniques,” Journal of Transportation Engineering, vol. 133, no. 7, pp. 433–440, Jul. 2007, doi: 10.1061/(ASCE)0733-947X(2007)133:7(433).
[7]J. Sun, M. Smith, L. Smith, S. Midha, and J. Bamber, “Object Surface Recovery Using a Multi-Light Photometric Stereo Technique for Non-Lambertian Surfaces Subject to Shadows and Specularities,” Image and Vision Computing, vol. 25, no. 7, pp. 1050–1057, Jul. 2007, doi: 10.1016/j.imavis.2006.04.025.
[8]G. A. Atkinson, M. F. Hansen, M. L. Smith, and L. N. Smith, “A Efficient and Practical 3D Face Scanner Using Near Infrared and Visible Photometric Stereo,” Procedia Computer Science, vol. 2, pp. 11–19, Jan. 2010, doi: 10.1016/j.procs.2010.11.003.
[9]A. Ghosh, G. Fyffe, B. Tunwattanapong, J. Busch, X. Yu, and P. Debevec, “Multiview Face Capture Using Polarized Spherical Gradient Illumination,” ACM Transactions on Graphics, vol. 30, no. 6, pp. 1–10, Dec. 2011, doi: 10.1145/2070781.2024163.
[10]J. Riviere, P. Gotardo, D. Bradley, A. Ghosh, and T. Beeler, “Single-Shot High-Quality Facial Geometry and Skin Appearance Capture,” ACM Transactions on Graphics, vol. 39, no. 4, p. 81:81:1-81:81:12, Aug. 2020, doi: 10.1145/3386569.3392464.
[11]P. Debevec, T. Hawkins, C. Tchou, H.-P. Duiker, W. Sarokin, and M. Sagar, “Acquiring the Reflectance Field of a Human Face,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques, in SIGGRAPH ’00. USA: ACM Press/Addison-Wesley Publishing Co., Jul. 2000, pp. 145–156. doi: 10.1145/344779.344855.
[12]P. Debevec, A. Wenger, C. Tchou, A. Gardner, J. Waese, and T. Hawkins, “A Lighting Reproduction Approach to Live-Action Compositing,” ACM Transactions on Graphics, vol. 21, no. 3, pp. 547–556, Jul. 2002, doi: 10.1145/566654.566614.
[13]K. Guo et al., “The Relightables: Volumetric Performance Capture of Humans with Realistic Relighting,” ACM Transactions on Graphics, vol. 38, no. 6, p. 217:1-217:19, Nov. 2019, doi: 10.1145/3355089.3356571.
[14]H. Zhou, S. Hadap, K. Sunkavalli, and D. Jacobs, “Deep Single-Image Portrait Relighting,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, pp. 7193–7201. doi: 10.1109/ICCV.2019.00729.
[15]O. Sorkine and M. Alexa, “As-Rigid-As-Possible Surface Modeling,” The Eurographics Association, 2007. doi: 10.2312/SGP/SGP07/109-116.
[16]V. Kazemi and J. Sullivan, “One Millisecond Face Alignment with an Ensemble of Regression Trees,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp. 1867–1874. doi: 10.1109/CVPR.2014.241.
[17]Xiangyu Zhu Jianzhu Guo and Zhen Lei. 3DDFA. https://github.com/cleardusk/3DDFA, 2018.
[18]T. Sun et al., “Single Image Portrait Relighting,” ACM Transactions on Graphics, vol. 38, no. 4, p. 79:1-79:12, Jul. 2019, doi: 10.1145/3306346.3323008.
[19]C. Ji et al., “Geometry-Aware Single-Image Full-Body Human Relighting,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., in Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2022, pp. 388–405. doi: 10.1007/978-3-031-19787-1_22.
[20]J. Wang et al., “Deep High-Resolution Representation Learning for Visual Recognition.” arXiv, Mar. 13, 2020. doi: 10.48550/arXiv.1908.07919.
[21]Y. Wang, A. Holynski, X. Zhang, and X. Zhang, “SunStage: Portrait Reconstruction and Relighting using the Sun as a Light Stage.” arXiv, Mar. 24, 2023. doi: 10.48550/arXiv.2204.03648.
[22]T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero, “Learning a Model of Facial Shape and Expression from 4D Scans,” ACM Transactions on Graphics, vol. 36, no. 6, p. 194:1-194:17, Nov. 2017, doi: 10.1145/3130800.3130813.
[23]C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge,” in 2013 IEEE International Conference on Computer Vision Workshops, Feb. 2013, pp. 397–403. doi: 10.1109/ICCVW.2013.59.
[24]M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated Facial Landmarks in the Wild: A Large-Scale, Real-World Database for Facial Landmark Localization,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Jan. 2011, pp. 2144–2151. doi: 10.1109/ICCVW.2011.6130513.
[25]X. Zhu and D. Ramanan, “Face Detection, Pose Estimation, and Landmark Localization in the Wild,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2012, pp. 2879–2886. doi: 10.1109/CVPR.2012.6248014.
[26]V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive Facial Feature Localization,” in Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, Eds., in Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2012, pp. 679–692. doi: 10.1007/978-3-642-33712-3_49.
[27]P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar, “Localizing Parts of Faces Using a Consensus of Exemplars,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2930–2940, Feb. 2013, doi: 10.1109/TPAMI.2013.23.
[28]K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maître, Eds., “XM2VTSDB: The Extended M2VTS Database,” Proc. Second International Conference on Audio- and Video-based Biometric Person Authentication (AVBPA’99), 1999.
[29]T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative Adversarial Networks.” arXiv, Mar. 29, 2019. doi: 10.48550/arXiv.1812.04948.
[30]T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive Growing of GANs for Improved Quality, Stability, and Variation.” arXiv, Feb. 26, 2018. doi: 10.48550/arXiv.1710.10196.
[31]A. Bulat and G. Tzimiropoulos, “How Far Are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks),” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 1021–1030. doi: 10.1109/ICCV.2017.116.
[32]A. Newell, K. Yang, and J. Deng, “Stacked Hourglass Networks for Human Pose Estimation.” arXiv, Jul. 26, 2016. doi: 10.48550/arXiv.1603.06937.
[33]A. Bulat and G. Tzimiropoulos, “Binarized Convolutional Landmark Localizers for Human Pose Estimation and Face Alignment with Limited Resources,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 3726–3734. doi: 10.1109/ICCV.2017.400.
[34]S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detection Benchmark.” arXiv, Nov. 20, 2015. doi: 10.48550/arXiv.1511.06523.
[35]J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia, and S. Zafeiriou, “RetinaFace: Single-stage Dense Face Localisation in the Wild.” arXiv, May 04, 2019. doi: 10.48550/arXiv.1905.00641.
[36]R. Ranjan, V. M. Patel, and R. Chellappa, “HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, pp. 121–135, Jan. 2019, doi: 10.1109/TPAMI.2017.2781233.
[37]R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.” arXiv, Oct. 22, 2014. doi: 10.48550/arXiv.1311.2524.
[38]K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M. Smeulders, “Segmentation as Selective Search for Object Recognition,” in 2011 International Conference on Computer Vision, Jan. 2011, pp. 1879–1886. doi: 10.1109/ICCV.2011.6126456.
[39]A. S. Joshi, S. S. Joshi, G. Kanahasabai, R. Kapil, and S. Gupta, “Deep Learning Framework to Detect Face Masks from Video Footage,” in 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN), Sep. 2020, pp. 435–440. doi: 10.1109/CICN49253.2020.9242625.
[40]K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint Face Detection and Alignment Using Multi-Task Cascaded Convolutional Networks,” IEEE Signal Process. Lett., vol. 23, no. 10, pp. 1499–1503, Oct. 2016, doi: 10.1109/LSP.2016.2603342.
[41]M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks.” arXiv, Mar. 21, 2019. doi: 10.48550/arXiv.1801.04381.
[42]C. Lugaresi et al., “MediaPipe: A Framework for Building Perception Pipelines.” arXiv, Jun. 14, 2019. doi: 10.48550/arXiv.1906.08172.
[43]U. Sara, M. Akter, and M. S. Uddin, “Image Quality Assessment through FSIM, SSIM, MSE and PSNR—A Comparative Study,” Journal of Computer and Communications, vol. 7, no. 3, Art. no. 3, Mar. 2019, doi: 10.4236/jcc.2019.73002.
[44]J. Søgaard, L. Krasula, M. Shahid, D. Temel, K. Brunnström, and M. Razaak, “Applicability of Existing Objective Metrics of Perceptual Quality for Adaptive Video Streaming,” IS&T Int’l. Symp. on Electronic Imaging, vol. 28, no. 13, pp. 1–7, Feb. 2016, doi: 10.2352/ISSN.2470-1173.2016.13.IQSP-206.
[45]張欣媛,“應用簡化立體光度法於肖像光影重建系統”,國立臺灣科技大學碩士論文,2021。
[46]D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization.” arXiv, Jan. 29, 2017. doi: 10.48550/arXiv.1412.6980.