簡易檢索 / 詳目顯示

研究生: 鄭紹廷
Shao-Ting Cheng
論文名稱: 基於自監督學習的語音深度偽造檢測方法
A Self-Supervised Learning-Based Method for Detecting Deepfake Audio
指導教授: 洪西進
Shi-Jinn Horng
口試委員: 林祝興
楊竹星
李正吉
顏成安
洪西進
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 38
中文關鍵詞: 音頻偽造辨識深度學習自監督學習音頻訊號分析
外文關鍵詞: Audio DeepFake Detection, Deep Learning, self-supervised Learning, Audio Signal Analysis
相關次數: 點閱:279下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在這個資訊爆炸的時代,音頻深度偽造技術(Audio DeepFake)的發展及應用已愈來愈廣泛,然而,帶來的問題也越來越明顯,包括偽造的報告、公眾人物的言論改變,甚至被用於詐騙。不論在學術或業界,對於如何準確且有效地偵測偽造音頻的需求越來越迫切。然而,隨著語音合成和語音轉換技術的快速發展,新型的語音偽造技術不斷出現,使得現有的偵測方法的泛化能力面臨嚴重的挑戰。
    本文提出一種基於自監督學習的音頻DeepFake偵測方法,結合音頻混合(Audio mixup)及軟標籤訓練(Soft Label Training),直接利用音頻訊號通過卷積神經網路(Convolutional Neural Network, CNN)進行分析。這種方法無需將音頻信號轉換成頻譜圖像,減少了轉換的運算成本。該方法以真實人類語音和合成語音訓練,並證明其在音頻真偽分類任務上的高準確度以及高泛化性,在多個資料集上達到了90.55%的平均準確率,超越了現在最先進(State-of-the-Art)的檢測方法,並且在參數量及運算量都遠低於現在最先進的模型。
    該方法的主要貢獻在於其有效地利用自監督學習進行預訓練網路,將神經網路映射到一個音頻空間,並對其進行微調。為了驗證其效能,我們將其與現有的多種偵測方法進行了比較,結果顯示,即使音頻偽造的方法並未出現在訓練集中,我們的方法也能保持高度的準確性和健壯性。我們期望這項研究能為音頻深度偽造偵測領域提供新的視角和工具,並助力相關領域的研究者在進行更深入的研究和應用開發時,能有更多的參考和選擇。


    In this age of information explosion, the development and application of audio deepfake technology have become increasingly widespread. However, the problems it brings have also become more apparent, including forged reports, altered public speeches, even being used in scams. The need to accurately and effectively detect forged audio is increasingly pressing in both academia and industry. However, with the rapid development of voice synthesis and voice conversion technology, new types of voice forgery techniques are constantly emerging, posing severe challenges to the generalization capabilities of existing detection methods.
    This paper proposes an Audio Deepfake detection method based on self-supervised learning, combining Audio mixup and Soft Label Training, directly analyzing audio signals through a Convolutional Neural Network (CNN). This approach eliminates the need to convert audio signals into spectrogram images, thus reducing the computational cost of conversion. The method trains on both real human speech and synthesized speech, and it has demonstrated high accuracy and generalization capability in audio authenticity classification tasks. It achieved an average accuracy of 90.55% across multiple datasets, surpassing current state-of-the-art detection methods, and it requires significantly fewer parameters and computational power than the most advanced models today.
    The main contribution of this method is its effective use of self-supervised learning to pre-train the network, map the neural network to an audio space, and fine-tune it. To verify its efficacy, we compared it with several existing detection methods. The results show that even if the audio forgery method does not appear in the training set, our method can still maintain high accuracy and robustness. We hope that this research can provide new perspectives and tools for the field of audio deepfake detection, and assist researchers in related fields in conducting deeper research and application development, offering more references and choices.

    致 謝 辭 i 摘 要 ii ABSTRACT iii 目 錄 iv 圖 目 錄 vi 表 目 錄 vii 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究架構 2 第二章 相關研究 3 2.1 語音深度偽造方法 3 2.2 語音深度偽造檢測方法 6 2.3 Masked Autoencoders 7 2.4 自監督學習 8 2.5 語音編解碼器 9 2.6 語音深度偽造資料集 10 第三章 研究方法 13 3.1 Masked Autoencoder 13 3.1.1 預訓練資料集 13 3.1.2 Vector-Quantized Variational Autoencoders 13 3.1.3 Waveforms Masking 16 3.2 Audio DeepFake偵測模型訓練 17 3.3 音頻混合 19 第四章 研究結果 20 4.1 軟硬體規格表 20 4.2 實驗結果 21 4.3 性能測試 21 4.4 泛化性測試 22 4.5 輕量化測試 23 4.6 消融實驗 24 第五章 結論與未來展望 25 參考文獻 26

    [1] “跨國犯罪 分 子 用 Deepfake 生 成 老 闆 語 音 , 詐騙近 台 幣 729 萬.”
    https://buzzorange.com/techorange/2020/08/03/deepfake-audio-fraud/ (accessed
    Jun. 30, 2023)
    [2] “AI 成 精 ? 和 AI 網 聊 9 秒 被 騙 132 萬 全 網 震 驚.”
    https://www.secretchina.com/news/b5/2023/05/25/1036696.html (accessed Jun.
    30, 2023)
    [3] Hsu, Chin-Cheng, et al. "Voice conversion from non-parallel corpora using
    variational auto-encoder." 2016 Asia-Pacific Signal and Information Processing
    Association Annual Summit and Conference (APSIPA). IEEE, 2016.
    [4] Qian, Kaizhi, et al. "Autovc: Zero-shot voice style transfer with only
    autoencoder loss." International Conference on Machine Learning. PMLR, 2019.
    [5] Goodfellow, Ian, et al. "Generative adversarial networks." Communications of
    the ACM 63.11 (2020): 139-144.
    [6] Kaneko, Takuhiro, and Hirokazu Kameoka. "Cyclegan-vc: Non-parallel voice
    conversion using cycle-consistent adversarial networks." 2018 26th European
    Signal Processing Conference (EUSIPCO). IEEE, 2018.
    [7] Kaneko, Takuhiro, et al. "Cyclegan-vc2: Improved cyclegan-based non-parallel
    voice conversion." ICASSP 2019-2019 IEEE International Conference on
    Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
    [8] Kaneko, Takuhiro, et al. "Stargan-vc2: Rethinking conditional methods for
    stargan-based voice conversion." arXiv preprint arXiv:1907.12279 (2019).
    [9] Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv
    preprint arXiv:1609.03499 (2016).
    [10] Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel
    spectrogram predictions." 2018 IEEE international conference on acoustics,
    speech and signal processing (ICASSP). IEEE, 2018.
    [11] Ren, Yi, et al. "Fastspeech 2: Fast and high-quality end-to-end text to speech."
    arXiv preprint arXiv:2006.04558 (2020).
    [12] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv
    preprint arXiv:1703.10135 (2017).
    [13] Arık, Sercan Ö., et al. "Deep voice: Real-time neural text-to-speech."
    International conference on machine learning. PMLR, 2017.
    [14] Gibiansky, Andrew, et al. "Deep voice 2: Multi-speaker neural text-to-speech."
    Advances in neural information processing systems 30 (2017).
    [15] Ping, Wei, et al. "Deep voice 3: Scaling text-to-speech with convolutional
    sequence learning." arXiv preprint arXiv:1710.07654 (2017).
    [16] Chen, Tianxiang, et al. "Generalization of Audio Deepfake Detection." Odyssey.
    2020.
    [17] Singh, Arun Kumar, and Priyanka Singh. "Detection of ai-synthesized speech
    using cepstral & bispectral statistics." 2021 IEEE 4th International Conference
    on Multimedia Information Processing and Retrieval (MIPR). IEEE, 2021.
    [18] Liu, Tianyun, et al. "Identification of fake stereo audio using SVM and CNN."
    Information 12.7 (2021): 263.
    [19] Subramani, Nishant, and Delip Rao. "Learning efficient representations for fake
    speech detection." Proceedings of the AAAI Conference on Artificial
    Intelligence. Vol. 34. No. 04. 2020.
    [20] Todisco, Massimiliano, et al. "ASVspoof 2019: Future horizons in spoofed and
    fake audio detection." arXiv preprint arXiv:1904.05441 (2019).
    [21] Ballesteros, Dora M., et al. "Deep4SNet: deep learning for fake speech
    classification." Expert Systems with Applications 184 (2021): 115465.
    [22] Lataifeh, Mohammed, et al. "Arabic audio clips: Identification and
    discrimination of authentic Cantillations from imitations." Neurocomputing 418
    (2020): 162-177.
    [23] Lataifeh, Mohammed, and Ashraf Elnagar. "Ar-DAD: Arabic diversified audio
    dataset." Data in Brief 33 (2020): 106503.
    [24] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural
    information processing systems 30 (2017).
    [25] Bao, Hangbo, et al. "Beit: Bert pre-training of image transformers." arXiv
    preprint arXiv:2106.08254 (2021).
    [26] Chen, Mark, et al. "Generative pretraining from pixels." International conference
    on machine learning. PMLR, 2020.
    [27] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for
    image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
    [28] Zhang, Sixiao, et al. "Graph masked autoencoders with transformers." arXiv e-
    prints (2022): arXiv-2202.
    [29] Gong, Yuan, et al. "Contrastive audio-visual masked autoencoder." arXiv
    preprint arXiv:2210.07839 (2022).
    [30] Huang, Po-Yao, et al. "Masked autoencoders that listen." Advances in Neural
    Information Processing Systems 35 (2022): 28708-28720.
    [31] Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual
    representation learning by context prediction." Proceedings of the IEEE
    international conference on computer vision. 2015.
    [32] Pathak, Deepak, et al. "Learning features by watching objects move."
    Proceedings of the IEEE conference on computer vision and pattern recognition.
    2017.
    [33] Gidaris, Spyros, Praveer Singh, and Nikos Komodakis. "Unsupervised
    representation learning by predicting image rotations." arXiv preprint
    arXiv:1803.07728 (2018).
    [34] Noroozi, Mehdi, and Paolo Favaro. "Unsupervised learning of visual
    representations by solving jigsaw puzzles." Computer Vision–ECCV 2016: 14th
    European Conference, Amsterdam, The Netherlands, October 11-14, 2016,
    Proceedings, Part VI. Cham: Springer International Publishing, 2016.
    [35] Wang, Xiaolong, and Abhinav Gupta. "Unsupervised learning of visual
    representations using videos." Proceedings of the IEEE international conference
    on computer vision. 2015.
    [36] Zhang, Richard, Phillip Isola, and Alexei A. Efros. "Colorful image
    colorization." Computer Vision–ECCV 2016: 14th European Conference,
    Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14.
    Springer International Publishing, 2016.
    [37] Hadsell, Raia, Sumit Chopra, and Yann LeCun. "Dimensionality reduction by
    learning an invariant mapping." 2006 IEEE Computer Society Conference on
    Computer Vision and Pattern Recognition (CVPR'06). Vol. 2. IEEE, 2006.
    [38] Grill, Jean-Bastien, et al. "Bootstrap your own latent-a new approach to self-
    supervised learning." Advances in neural information processing systems 33
    (2020): 21271-21284.
    [39] Chen, Ting, et al. "A simple framework for contrastive learning of visual
    representations." International conference on machine learning. PMLR, 2020.
    [40] Chen, Xinlei, and Kaiming He. "Exploring simple siamese representation
    learning." Proceedings of the IEEE/CVF conference on computer vision and
    pattern recognition. 2021
    [41] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for
    language understanding." arXiv preprint arXiv:1810.04805 (2018).
    [42] Baevski, Alexei, Steffen Schneider, and Michael Auli. "vq-wav2vec: Self-
    supervised learning of discrete speech representations." arXiv preprint arXiv:1910.05453 (2019)..
    [43] Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "Waveglow: A flow-based
    generative network for speech synthesis." ICASSP 2019-2019 IEEE
    International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    IEEE, 2019.
    [44] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation
    learning." Advances in neural information processing systems 30 (2017).
    [45] Solak, Imdat. "The M-AILABS speech dataset." (2019). [Online] Available:
    https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
    [46] Arik, Sercan, et al. "Neural voice cloning with a few samples." Advances in
    neural information processing systems 31 (2018).
    [47] Reimao, Ricardo, and Vassilios Tzerpos. "For: A dataset for synthetic speech detection." 2019 International Conference on Speech Technology and Human-
    Computer Dialogue (SpeD). IEEE, 2019.
    [48] Lataifeh, Mohammed, and Ashraf Elnagar. "Ar-DAD: Arabic diversified audio
    dataset." Data in Brief 33 (2020): 106503.
    [49] Khalid, Hasam, et al. "FakeAVCeleb: A novel audio-video multimodal deepfake
    dataset." arXiv preprint arXiv:2108.05080 (2021).
    [50] Yi, J.; Fu, R.; Tao, J.; Nie, S.; Ma, H.; Wang, C.; Wang, T.; Tian, Z.; Bai, Y.; Fan, C. Add 2022: The first audio deep synthesis detection challenge. In Proceedings
    of the IEEE International Conference on Acoustics, Speech, and Signal
    Processing, Singapore, 23–27 May 2022; p. 5.
    [51] Zen, Heiga, et al. "Libritts: A corpus derived from librispeech for text-to-
    speech." arXiv preprint arXiv:1904.02882 (2019).
    [52] Fonseca, Eduardo, et al. "Freesound datasets: a platform for the creation of open
    audio datasets." Hu X, Cunningham SJ, Turnbull D, Duan Z, editors. Proceedings of the 18th ISMIR Conference; 2017 oct 23-27; Suzhou, China.[Canada]: International Society for Music Information Retrieval; 2017. p. 486-93.. International Society for Music Information Retrieval (ISMIR), 2017.
    [53] Engel, Jesse, et al. "Neural audio synthesis of musical notes with wavenet
    autoencoders." International Conference on Machine Learning. PMLR, 2017.
    [54] Zeghidour, Neil, et al. "Soundstream: An end-to-end neural audio codec."
    IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021):
    495-507.
    [55] Bartusiak, Emily R., and Edward J. Delp. "Frequency domain-based detection of
    generated audio." arXiv preprint arXiv:2205.01806 (2022).
    [56] Camacho, Steven, Dora Maria Ballesteros, and Diego Renza. "Fake speech
    recognition using deep learning." Applied Computer Sciences in Engineering:
    8th Workshop on Engineering Applications, WEA 2021, Medellín, Colombia,
    October 6–8, 2021, Proceedings 8. Springer International Publishing, 2021.
    [57] Kumar, Kundan, et al. "Melgan: Generative adversarial networks for conditional
    waveform synthesis." Advances in neural information processing systems 32
    (2019).

    無法下載圖示
    全文公開日期 2053/07/31 (校外網路)
    全文公開日期 2053/07/31 (國家圖書館:臺灣博碩士論文系統)
    QR CODE