簡易檢索 / 詳目顯示

研究生: 韓悅華
Yue-Hua Han
論文名稱: 人臉部位指引基石模型適應以泛化深偽影片辨識能力
Towards More General Video-based Deepfake Detection through Facial Feature Guided Adaptation for Foundation Model.
指導教授: 花凱龍
Kai-Lung Hua
口試委員: 陳駿丞
Jun-Cheng Chen
陳祝嵩
Chu-Song Chen
王鈺強
Yu-Chiang Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 112
語文別: 英文
論文頁數: 45
中文關鍵詞: 深偽影片辨識深度學習基石模型輕量化訓練
外文關鍵詞: Deepfake Detection, Deep Learning, Foundational Model, Parameter-Efficient Fine-Tuning
相關次數: 點閱:38下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著深度學習的進步,一般民眾現已能透過線上服務生成高度擬真的圖像,這使得有心人事可以輕易偽造用以散布假消息的圖片,並且對社會帶來潛在的危害。儘管人臉偽造檢測的研究近年來迅速成長,許多檢測方法仍無法有效應對新興合成技術所生成的深偽影片。具有豐富資訊的基石模型在近期被廣泛套用在各類任務上,並且在該任務的泛化性指標皆有顯著提升,其中的基石模型CLIP在圖像語意分割與物件分類任務中展示了強大的零樣本學習力,因此我們嘗試引入CLIP在深偽檢測任務上來解決泛化性的困境。受到近期在輕量化微調領域的啟發,我透過側網路影像適應器的方法使得CLIP模型具備影像學習的能力。另外我們提出了人臉部位指引方法,透過引導模型專注於人臉特徵的學習,來提升模型識別偽造特徵的泛化能力。通過全面的跨數據集實驗,這個方法對未見的偽造樣本與方法表現出卓越的辨識能力。通過結合當前最佳的識別模型,組合模型在跨數據集的泛化性評估指標中,平均提高了約2.1%的成績,突顯了這個方法的有效性及互補性。


    The emergence of deep learning has led to generative models that can create highly realistic synthetic images. This poses challenges due to the potential for misuse. In response, there has been a surge in research on face forgery detection. However, many existing detection methods struggle with Deepfakes created using new synthesis techniques. To tackle this issue of generalization, we introduce a novel Deepfake detection method. It leverages the rich information in pre-trained Foundation Models, particularly the CLIP model known for its strong zero-shot capabilities in other tasks. Drawing inspiration from recent advancements in parameter-efficient fine-tuning, we introduce a video adapter based on a side-network for video-related tasks. Additionally, we use Facial Component Guidance (FCG) to help the model focus on crucial facial areas, enhancing the robustness and generality of Deepfake detection.

    Our method has shown outstanding performance in extensive cross-dataset experiments, particularly against unseen forgery samples. By combining our approach with the latest state-of-the-art method, we achieved an average performance improvement of 2.1% in cross-dataset evaluations, highlighting its effectiveness.

    論文摘要 I Abstract II Acknowledgement III Contents IV List of Figures VI List of Tables IX 1 Introduction 1 2 Related Work 4 2.1 Foundation Model, Parameter-Efficient Fine-Tuning 4 2.2 Synthetic Image Detection 5 2.3 Deepfake Detection 6 3 Proposed Method 9 3.1 Attention Attribute Extraction 9 3.2 Overall Structure 10 3.3 Facial Component Guidance 13 4 Experiments 15 4.1 Implementation Details 15 4.2 Datasets for Forged Faces 16 4.3 Dataset Preprocessing 16 4.4 Generalisation to Unseen Dataset 17 4.5 Generalisation to Unseen Manipulations 18 4.6 Robustness Evaluation Against Perturbations 19 4.7 Ablation Study on Model Components 23 4.8 Analysis on Hyper-Parameters 24 4.9 Qualitative Results and Discussions 30 5 Conclusions 35 References 36

    [1] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” in International Conference on Learning Representations (ICCV), 2018.
    [2] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park, “Scaling up gans for text-to-image synthesis,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [3] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [4] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations (ICLR), 2022.
    [5] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila, “Analyzing and improving the image quality of stylegan,” 2020.
    [6] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen, “Hierarchical text-conditional image generation with clip latents,” 2022.
    [7] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” 2022.
    [8] Jilyan Bianca Dy, John Jethro Virtusio, Daniel Stanley Tan, Yong-Xiang Lin, Joel Ilao, Yung-Yao Chen, and Kai-Lung Hua, “Mcgan: Mask controlled generative adversarial network for image retargeting,” Neural Comput. Appl., vol. 35, no. 14, pp. 10497–10509, feb 2023.
    [9] Yu-Chieh Chen, Daniel Stanley Tan, Wen-Huang Cheng, and Kai-Lung Hua, “3d object completion via class-conditional generative adversarial network,” in MultiMedia Modeling, Ioannis Kompatsiaris, Benoit Huet, Vasileios Mezaris, Cathal Gurrin, Wen-Huang Cheng, and Stefanos Vrochidis, Eds., Cham, 2019, pp. 54–66, Springer International Publishing.
    [10] Daniel Stanley Tan, Chih-Yuan Yao, Conrado Ruiz, and Kai-Lung Hua, “Single-image depth inference using generative adversarial networks,” Sensors, vol. 19, no. 7, 2019.
    [11] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo, “Face x-ray for more general face forgery detection,” in Conference on Computer Vision and Pattern Recognition, 2020.
    [12] Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic, “Lips don’t lie: A generalisable and robust approach to face forgery detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
    [13] Kaede Shiohara and Toshihiko Yamasaki, “Detecting deepfakes with self-blended images,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [14] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu, “Celeb-df: A large-scale challenging dataset for deepfake forensics,” in Conference on Computer Vision and Patten Recognition (CVPR), 2020.
    [15] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer, “The deepfake detection challenge (dfdc) dataset,” 2020.
    [16] Xiangyu Zhu, Hongyan Fei, Bin Zhang, Tianshuo Zhang, Xiaoyu Zhang, Stan Z. Li, and Zhen Lei, “Face forgery detection by 3d de-composition and composition search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
    [17] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), 2021.
    [18] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski, “Dinov2: Learning robust visual features without supervision,” 2023.
    [19] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick, “Segment anything,” 2023.
    [20] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [21] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li, “Frozen clip models are efficient video learners,” 2022.
    [22] Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan, “Emergent correspondence from image diffusion,” 2023.
    [23] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim, “Visual prompt tuning,” European Conference on Computer Vision (ECCV), 2022.
    [24] Rajshekhar Das, Yonatan Dukler, Avinash Ravichandran, and Ashwin Swaminathan, “Learning expressive prompting with residuals for vision transformers,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [25] Xu Mengde, Zhang Zheng, Wei Fangyun, Hu Han, and Bai Xiang, “Side adapter network for open-vocabulary semantic segmentation,” in Conference on Computer Vision and Pattern Recognition (CVPR),2023.
    [26] David C. Epstein, Ishan Jain, Oliver Wang, and Richard Zhang, “Online detection of ai-generated images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, October 2023, pp. 382–392.
    [27] Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva, “Intriguing properties of synthetic images: From generative adversarial networks to diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023, pp. 973–982.
    [28] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva, “On the detection of synthetic images generated by diffusion models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
    [29] Lingzhi Zhang, Zhengjie Xu, Connelly Barnes, Yuqian Zhou, Qing Liu, He Zhang, Sohrab Amirghodsi, Zhe Lin, Eli Shechtman, and Jianbo Shi, “Perceptual artifacts localization for image synthesis tasks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 7579–7590.
    [30] Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva, “Raising the bar of ai-generated image detection with clip,” 2023.
    [31] Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, and Rita Cucchiara, “Parents and children: Distinguishing multimodal deepfakes from natural images,” 2023.
    [32] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in International Conference on Computer Vision (ICCV), 2019.
    [33] Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen, “Exploring temporal coherence for more general video face forgery detection,” in International Conference on Computer Vision (ICCV), 2021.
    [34] Hui Guo, Shu Hu, Xin Wang, Ming-Ching Chang, and Siwei Lyu, “Eyes tell all: Irregular pupil shapes reveal gan-generated faces,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
    [35] Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu, “Multi-attentional deepfake detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
    [36] Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu, “Generalizing face forgery detection with high-frequency features,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
    [37] Weiming Bai, Yufan Liu, Zhipeng Zhang, Bing Li, and Weiming Hu, “Aunet: Learning relations between action units for face forgery detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [38] Shichao Dong, Jin Wang, Renhe Ji, Jiajun Liang, Haoqiang Fan, and Zheng Ge, “Implicit identity leakage: The stumbling block to improving deepfake detection generalization,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [39] Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang, “Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [40] Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, and Wei Xia, “Learning self-consistency for deepfake detection,” in International Conference on Computer Vision (ICCV), 2021.
    [41] Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic, “Leveraging real talking faces via self-supervision for robust forgery detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [42] Po-Han Huang, Yue-Hua Han, Ernie Chu, Jun-Cheng Chen, and Kai-Lung Hua, “Multi-task self-blended images for face forgery detection,” in Proceedings of the 5th ACM International Conference on Multimedia in Asia, New York, NY, USA, 2024, MMAsia ’23, Association for Computing Machinery.
    [43] Jiun-Da Lin, Yue-Hua Han, Po-Han Huang, Julianne Tan, Jun-Cheng Chen, M. Tanveer, and Kai-Lung Hua, “Defaek: Domain effective fast adaptive network for face anti-spoofing,” Neural Networks, vol. 161, pp. 83–91, 2023.
    [44] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
    [45] Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H. Li, “Revisiting temporal modeling for clip-based image-to-video knowledge transferring,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
    [46] Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp.23034–23044.
    [47] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 7061–7070.
    [48] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu, “Zegclip: Towards adapting clip for zero-shot semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 11175–11185.
    [49] Yuqi Lin, Minghao Chen, Wenxiao Wang, Boxi Wu, Ke Li, Binbin Lin, Haifeng Liu, and Xiaofei He, “Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 15305–15314.
    [50] Wenbin He, Suphanut Jamonnak, Liang Gou, and Liu Ren, “Clip-s4: Language-guided self-supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 11207–11216.
    [51] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
    [52] deepfake, “Deefake,” https://github.com/deepfakes/faceswap/, 2018, Accessed: 2023-04-12.
    [53] Marek Kowalski, “Faceswap,” https://github.com/MarekKowalski/FaceSwap/, 2018, Accessed: 2023-04-12.
    [54] Justus Thies, Michael Zollhöfer, and Matthias Nießner, “Deferred neural rendering: Image synthesis using neural textures,” ACM Trans. Graph., 2019.
    [55] Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” Commun. ACM, 2018.
    [56] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen, “Advancing high fidelity identity swapping for forgery detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5074–5083.
    [57] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy, “DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
    [58] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
    [59] Adrian Bulat and Georgios Tzimiropoulos, “How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks),” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
    [60] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision, 2016.
    [61] Hila Chefer, Shir Gur, and Lior Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in International Conference on Computer Vision (ICCV), 2021.

    無法下載圖示 全文公開日期 2029/02/01 (校內網路)
    全文公開日期 2029/02/01 (校外網路)
    全文公開日期 2029/02/01 (國家圖書館:臺灣博碩士論文系統)
    QR CODE