簡易檢索 / 詳目顯示

研究生: 謝宗豪
Tsung-Hao Hsieh
論文名稱: 基於卷積視覺轉換器結構之空間及頻域雙分支人臉偽造檢測
Dual-branch Face Forgery Detection in Spatial and Frequency Domain Based on Convolutional ViT Structure
指導教授: 林昌鴻
Chang-Hong Lin
口試委員: 林昌鴻
Chang-Hong Lin
沈中安
Chung-An Shen
陳永耀
Yung-Yao Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 68
中文關鍵詞: 人臉偽造偵測卷積神經網路視覺轉換器深度學習
外文關鍵詞: Face Forgery Detection, Deep learning, Convolutional Neural Network, Visual Transformer
相關次數: 點閱:255下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著生成式深度學習技術進步,人們能夠產生出清晰的偽造影像,並在動畫、醫學、圖像修復等領域廣泛應用。然而,這些偽造影像有可能被用來惡意竄改影片內容或是冒充知名人士,甚至難以被人眼辨識,使我們必須研究如何正確判斷影片或影像是否被偽造。因此,使用深度學習的人臉偽造偵測技術開始被提出,利用神經網路偵測難以由人眼識別的偽造痕跡。
    本篇論文旨在設計出一個增強偵測低畫質影像能力的網路架構。過往的偵測方法雖然在偵測高畫質影像已經展現良好的準確度,但是在偵測低畫質影像仍然有進步空間,因為偽造痕跡在低畫質影像中可能會丟失或受雜訊干擾,導致模型難以偵測。因此,本論文提出一個新穎的雙分支模型以改善此問題。我們利用頻率域的資訊來彌補偽造痕跡在色域中因低畫質而丟失資訊的問題。一條分支提取色域中的特徵,而另一條分支提取頻率域中的特徵。兩條分支皆使用卷積神經網路(CNN)以及視覺轉換器(ViT)提取特徵,透過組合此兩種架構,我們可以有效地捕捉更微小的偽造瑕疵。在兩條分支學習各自的資訊後,我們利用跨分支混和方法使兩條分支能夠融合色域及頻率域的資訊。根據實驗結果,我們的模型在低畫質影像中相比以往的方法展現更佳的準確度。另外,在高畫質影像實驗結果也顯示我們的方法取得明顯的進步。


    With the advancement of generative deep learning techniques, people can now generate clear forged images, which are widely used in animation, medicine, image restoration, and other fields. However, these forged images can potentially be used for malicious purposes, such as altering video content or impersonating celebrities. Therefore, there is a need to study how to accurately determine whether a video or image has been forged. As a result, various techniques for detecting face forgery using deep learning have been suggested, utilizing neural networks to detect subtle forgery traces that are difficult for the human eyes to identify.
    This thesis aims to design an architecture that enhances the ability to detect low-quality images. Although previous detection methods have shown good performance in detecting high-quality images, there is still potential for enhancing the detection of low-quality images. The reason is that forgery artifact features in low-quality images may be lost or contaminated by noise, making it difficult for the model to detect. To address this issue, this thesis proposes a novel dual-branch model, which utilizes frequency-domain information to compensate for the information loss of the spatial domain. One branch extracts features from the spatial domain, while the other branch extracts features from the frequency domain. Both branches apply Convolutional Neural Network and Visual Transformer to extract features, allowing the model to capture finer-grained artifacts. After each branch learns its respective information, we perform a fusion module to exchange information. Based on the experimental results, our proposed model outperforms previous methods in detecting low-quality images. Furthermore, our approach also demonstrates significant improvement in detecting high-quality images.

    摘要 I ABSTRACT II 致謝 III LIST OF CONTENTS IV LIST OF FIGURES VI LIST OF TABLES VIII CHAPTER 1 INTRODUCTIONS 1 1.1 Motivation 1 1.2 Contributions 3 1.3 Thesis Organization 4 CHAPTER 2 RELATED WORKS 5 2.1 Spatial-based Face Forgery Detection 5 2.2 Frequency-based Face Forgery Detection 6 2.3 Vision Transformer 7 CHAPTER 3 PROPOSED METHOD 8 3.1 Data Preprocessing 10 3.1.1 Face Extraction 11 3.1.2 Random Horizontal Flip 13 3.1.3 Random Rotation 13 3.1.4 Random Erasing 15 3.1.5 Random Brightness 17 3.2 Frequency Operations 18 3.3 Convolutional Neural Network Architecture 24 3.3.1 EfficientNet-B3 25 3.3.2 MBConvolution Block 28 3.3.3 Squeeze-and-excitation Block 32 3.4 Vision Transformer Architecture 33 3.4.1 Transformer Encoder 36 3.4.2 Cross-attention Module 38 3.5 Loss Function 40 3.5.1 Binary Cross Entropy Loss 40 CHAPTER 4 EXPERIMENTAL RESULTS 41 4.1 Experimental Environment 41 4.2 Training Details 42 4.3 Dataset 43 4.4 Evaluation Metrics 45 4.5 Experimental Results 47 4.5.1 Comparisons with Previous Methods 47 4.5.2 Ablation Studies 49 CHAPTER 5 CONCLUSIONS and FUTURE WORKS 51 5.1 Conclusions 51 5.2 Future Works 52 REFERENCES 54

    [1] D. Coccomini, N. Messina, C. Gennaro, and F. Falchi, "Combining efficientnet and vision transformers for video deepfake detection," International Conference on Image Analysis and Processing, pp. 219-229, 2022.
    [2] J. Thies, M. Zollhöfer, and M. Nießner, "Deferred neural rendering: Image synthesis using neural textures," ACM Transactions on Graphics, vol. 38, no. 4, pp. 1-12, 2019.
    [3] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, "Face x-ray for more general face forgery detection," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5001-5010, 2020.
    [4] H. Liu, X. Li, W. Zhou, Y. Chen, Y. He, H. Xue, W. Zhang, and N. Yu, "Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 772-781, 2021.
    [5] H. Nguyen, J. Yamagishi, and I. Echizen, "Capsule-forensics: Using capsule networks to detect forged images and videos," IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2307-2311, 2019.
    [6] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, "Thinking in frequency: Face forgery detection by mining frequency-aware clues," European Conference on Computer Vision, pp. 86-103, 2020.
    [7] D. Wodajo and S. Atnafu, "Deepfake video detection using convolutional vision transformer," arXiv preprint arXiv:2102.11126, 2021.
    [8] N. Yu, L. Davis, and M. Fritz, "Attributing fake images to gans: Learning and analyzing GAN fingerprints," IEEE/CVF International Conference on Computer Vision, pp. 7556-7566, 2019.
    [9] J. Li, Y. Wang, T. Tan, and A. Jain, "Live face detection based on the analysis of fourier spectra," Proceedings of SPIE Biometric Technology for Human Identification, vol. 5404, pp. 296-303, 2004.
    [10] C. R. Chen, Q. Fan, and R. Panda, "Crossvit: Cross-attention multi-scale vision transformer for image classification," IEEE/CVF International Conference on Computer Vision, pp. 357-366, 2021.
    [11] J. Chung and A. Zisserman, "Lip reading in the wild," Asian Conference on Computer Vision, vol. 13, pp. 87-103, 2017.
    [12] Y. Li, M. Chang, and S. Lyu, "In ictu oculi: Exposing AI created fake videos by detecting eye blinking," IEEE International Workshop on Information Forensics and Security, pp. 1-7, 2018.
    [13] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, "Mesonet: A compact facial video forgery detection network," IEEE International Workshop on Information Forensics and Security, pp. 1-7, 2018.
    [14] H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, "Multi-task learning for detecting and segmenting manipulated facial images and videos," IEEE International Conference on Biometrics Theory, Applications and Systems, pp. 1-8, 2019.
    [15] P. Zhou, X. Han, V. Morariu, and L. Davis, "Two-stream neural networks for tampered face detection," IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1831-1839, 2017.
    [16] C. Miao, Z. Tan, Q. Chu, N. Yu, and G. Guo, "Hierarchical frequency-assisted interactive networks for face manipulation detection," IEEE Transactions on Information Forensics and Security, vol. 17, pp. 3008-3021, 2022.
    [17] Y. Luo, Y. Zhang, J. Yan, and W. Liu, "Generalizing face forgery detection with high-frequency features," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16317-16326, 2021.
    [18] G. Strang, "The discrete cosine transform," Society for Industrial and Applied Mathematics Review, vol. 41, no. 1, pp. 135-147, 1999.
    [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in Neural Information Processing Systems, vol. 30, 2017.
    [20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
    [21] C. Shorten and T. Khoshgoftaar, "A survey on image data augmentation for deep learning," Journal of Big Data, vol. 6, no. 1, pp. 1-48, 2019.
    [22] A. Fernández, S. Garcia, F. Herrera, and N. Chawla, "SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary," Journal of Artificial Intelligence Research, vol. 61, pp. 863-905, 2018.
    [23] A. Buslaev, V. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. Kalinin, "Albumentations: Fast and flexible image augmentations," Information, vol. 11, no. 2, pp. 125, 2020.
    [24] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, "Random erasing data augmentation," AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 13001-13008, 2020.
    [25] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, "Joint face detection and alignment using multitask cascaded convolutional networks," IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, 2016.
    [26] E. Kirkland, "Bilinear interpolation," Advanced Computing in Electron Microscopy, pp. 261-263, 2010.
    [27] Z. Yan, P. Sun, Y. Lang, S. Du, S. Zhang, and W. Wang, "Landmark enhanced multimodal graph learning for deepfake video detection," arXiv preprint arXiv:2209.05419, 2022.
    [28] J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, S. Lim, and Y. Jiang, "M2TR: Multi-modal multi-scale transformers for deepfake detection," International Conference on Multimedia Retrieval, pp. 615-623, 2022.
    [29] J. Han and C. Moraga, "The influence of the sigmoid function parameters on the speed of backpropagation learning," International Workshop on Artificial Neural Networks, pp. 195-201, 1995.
    [30] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," International Conference on Machine Learning, pp. 6105-6114, 2019.
    [31] F. Chollet, "Xception: Deep learning with depthwise separable convolutions," IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251-1258, 2017.
    [32] A. Rasamoelina, F. Adjailia, and P. Sinčák, "A review of activation function for artificial neural network," IEEE World Symposium on Applied Machine Intelligence and Informatics, pp. 281-286, 2020.
    [33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510-4520, 2018.
    [34] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, 2018.
    [35] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
    [36] D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)," arXiv preprint arXiv:1606.08415, 2016.
    [37] P. Boer, D. Kroese, S. Mannor, and R. Rubinstein, "A tutorial on the cross-entropy method," Annals of Operations Research, vol. 134, pp. 19-67, 2005.
    [38] U. Ruby and V. Yendapalli, "Binary cross entropy with deep learning technique for image classification," International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 10, 2020.
    [39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala "Pytorch: An imperative style, high-performance deep learning library," Advances in Neural Information Processing Systems, vol. 32, 2019.
    [40] A. Ismail, S. Ahmad, A. Soh, K. Hassan, and H. Harith, "Improving convolutional neural network (CNN) architecture (miniVGGNet) with batch normalization and learning rate decay factor for image classification," International Journal of Integrated Engineering, vol. 11, no. 4, 2019.
    [41] S. Pan and Q. Yang, "A survey on transfer learning," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, 2009.
    [42] A. Krizhevsky, I. Sutskever, and G. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in Neural Information Processing Systems, vol. 25, 2012.
    [43] D. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
    [44] I. Bello, B. Zoph, V. Vasudevan, and Q. Le, "Neural optimizer search with reinforcement learning," International Conference on Machine Learning, pp. 459-468, 2017.
    [45] J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of Machine Learning Research, vol. 12, no. 7, 2011.
    [46] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, "Faceforensics++: Learning to detect manipulated facial images," IEEE/CVF International Conference on Computer Vision, pp. 1-11, 2019.
    [47] T. Nguyen, Q. Nguyen, D. Nguyen, D. Nguyen, T. Huyuh-The, S. Nahavandi, T. Nguyen, Q. Pham, and C. Nguyen, "Deep learning for deepfakes creation and detection: A survey," Computer Vision and Image Understanding, vol. 223, pp. 103525, 2022.
    [48] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, "Face2face: Real-time face capture and reenactment of RGB videos," IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387-2395, 2016.
    [49] J. Townsend, "Theoretical analysis of an alphabetic confusion matrix," Perception & Psychophysics, vol. 9, pp. 40-50, 1971.
    [50] J. Mandrekar, "Receiver operating characteristic curve in diagnostic test assessment," Journal of Thoracic Oncology, vol. 5, no. 9, pp. 1315-1316, 2010.
    [51] Z. Sun, Y. Han, Z. Hua, N. Ruan, and W. Jia, "Improving the efficiency and robustness of deepfakes detection through precise geometric features," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3609-3618, 2021.
    [52] X. Yang, Y. Li, and S. Lyu, "Exposing deep fakes using inconsistent head poses," IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8261-8265, 2019.
    [53] M. Li, B. Liu, Y. Hu, L. Zhang, and S. Wang, "Deepfake detection using robust spatial and temporal features from facial landmarks," IEEE International Workshop on Biometrics and Forensics, pp. 1-6, 2021.

    無法下載圖示 全文公開日期 2025/08/01 (校內網路)
    全文公開日期 2025/08/01 (校外網路)
    全文公開日期 2025/08/01 (國家圖書館:臺灣博碩士論文系統)
    QR CODE