研究生: |
鄭紹廷 Shao-Ting Cheng |
---|---|
論文名稱: |
基於自監督學習的語音深度偽造檢測方法 A Self-Supervised Learning-Based Method for Detecting Deepfake Audio |
指導教授: |
洪西進
Shi-Jinn Horng |
口試委員: |
林祝興
楊竹星 李正吉 顏成安 洪西進 |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 38 |
中文關鍵詞: | 音頻偽造辨識 、深度學習 、自監督學習 、音頻訊號分析 |
外文關鍵詞: | Audio DeepFake Detection, Deep Learning, self-supervised Learning, Audio Signal Analysis |
相關次數: | 點閱:279 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在這個資訊爆炸的時代,音頻深度偽造技術(Audio DeepFake)的發展及應用已愈來愈廣泛,然而,帶來的問題也越來越明顯,包括偽造的報告、公眾人物的言論改變,甚至被用於詐騙。不論在學術或業界,對於如何準確且有效地偵測偽造音頻的需求越來越迫切。然而,隨著語音合成和語音轉換技術的快速發展,新型的語音偽造技術不斷出現,使得現有的偵測方法的泛化能力面臨嚴重的挑戰。
本文提出一種基於自監督學習的音頻DeepFake偵測方法,結合音頻混合(Audio mixup)及軟標籤訓練(Soft Label Training),直接利用音頻訊號通過卷積神經網路(Convolutional Neural Network, CNN)進行分析。這種方法無需將音頻信號轉換成頻譜圖像,減少了轉換的運算成本。該方法以真實人類語音和合成語音訓練,並證明其在音頻真偽分類任務上的高準確度以及高泛化性,在多個資料集上達到了90.55%的平均準確率,超越了現在最先進(State-of-the-Art)的檢測方法,並且在參數量及運算量都遠低於現在最先進的模型。
該方法的主要貢獻在於其有效地利用自監督學習進行預訓練網路,將神經網路映射到一個音頻空間,並對其進行微調。為了驗證其效能,我們將其與現有的多種偵測方法進行了比較,結果顯示,即使音頻偽造的方法並未出現在訓練集中,我們的方法也能保持高度的準確性和健壯性。我們期望這項研究能為音頻深度偽造偵測領域提供新的視角和工具,並助力相關領域的研究者在進行更深入的研究和應用開發時,能有更多的參考和選擇。
In this age of information explosion, the development and application of audio deepfake technology have become increasingly widespread. However, the problems it brings have also become more apparent, including forged reports, altered public speeches, even being used in scams. The need to accurately and effectively detect forged audio is increasingly pressing in both academia and industry. However, with the rapid development of voice synthesis and voice conversion technology, new types of voice forgery techniques are constantly emerging, posing severe challenges to the generalization capabilities of existing detection methods.
This paper proposes an Audio Deepfake detection method based on self-supervised learning, combining Audio mixup and Soft Label Training, directly analyzing audio signals through a Convolutional Neural Network (CNN). This approach eliminates the need to convert audio signals into spectrogram images, thus reducing the computational cost of conversion. The method trains on both real human speech and synthesized speech, and it has demonstrated high accuracy and generalization capability in audio authenticity classification tasks. It achieved an average accuracy of 90.55% across multiple datasets, surpassing current state-of-the-art detection methods, and it requires significantly fewer parameters and computational power than the most advanced models today.
The main contribution of this method is its effective use of self-supervised learning to pre-train the network, map the neural network to an audio space, and fine-tune it. To verify its efficacy, we compared it with several existing detection methods. The results show that even if the audio forgery method does not appear in the training set, our method can still maintain high accuracy and robustness. We hope that this research can provide new perspectives and tools for the field of audio deepfake detection, and assist researchers in related fields in conducting deeper research and application development, offering more references and choices.
[1] “跨國犯罪 分 子 用 Deepfake 生 成 老 闆 語 音 , 詐騙近 台 幣 729 萬.”
https://buzzorange.com/techorange/2020/08/03/deepfake-audio-fraud/ (accessed
Jun. 30, 2023)
[2] “AI 成 精 ? 和 AI 網 聊 9 秒 被 騙 132 萬 全 網 震 驚.”
https://www.secretchina.com/news/b5/2023/05/25/1036696.html (accessed Jun.
30, 2023)
[3] Hsu, Chin-Cheng, et al. "Voice conversion from non-parallel corpora using
variational auto-encoder." 2016 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA). IEEE, 2016.
[4] Qian, Kaizhi, et al. "Autovc: Zero-shot voice style transfer with only
autoencoder loss." International Conference on Machine Learning. PMLR, 2019.
[5] Goodfellow, Ian, et al. "Generative adversarial networks." Communications of
the ACM 63.11 (2020): 139-144.
[6] Kaneko, Takuhiro, and Hirokazu Kameoka. "Cyclegan-vc: Non-parallel voice
conversion using cycle-consistent adversarial networks." 2018 26th European
Signal Processing Conference (EUSIPCO). IEEE, 2018.
[7] Kaneko, Takuhiro, et al. "Cyclegan-vc2: Improved cyclegan-based non-parallel
voice conversion." ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
[8] Kaneko, Takuhiro, et al. "Stargan-vc2: Rethinking conditional methods for
stargan-based voice conversion." arXiv preprint arXiv:1907.12279 (2019).
[9] Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv
preprint arXiv:1609.03499 (2016).
[10] Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel
spectrogram predictions." 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, 2018.
[11] Ren, Yi, et al. "Fastspeech 2: Fast and high-quality end-to-end text to speech."
arXiv preprint arXiv:2006.04558 (2020).
[12] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv
preprint arXiv:1703.10135 (2017).
[13] Arık, Sercan Ö., et al. "Deep voice: Real-time neural text-to-speech."
International conference on machine learning. PMLR, 2017.
[14] Gibiansky, Andrew, et al. "Deep voice 2: Multi-speaker neural text-to-speech."
Advances in neural information processing systems 30 (2017).
[15] Ping, Wei, et al. "Deep voice 3: Scaling text-to-speech with convolutional
sequence learning." arXiv preprint arXiv:1710.07654 (2017).
[16] Chen, Tianxiang, et al. "Generalization of Audio Deepfake Detection." Odyssey.
2020.
[17] Singh, Arun Kumar, and Priyanka Singh. "Detection of ai-synthesized speech
using cepstral & bispectral statistics." 2021 IEEE 4th International Conference
on Multimedia Information Processing and Retrieval (MIPR). IEEE, 2021.
[18] Liu, Tianyun, et al. "Identification of fake stereo audio using SVM and CNN."
Information 12.7 (2021): 263.
[19] Subramani, Nishant, and Delip Rao. "Learning efficient representations for fake
speech detection." Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 34. No. 04. 2020.
[20] Todisco, Massimiliano, et al. "ASVspoof 2019: Future horizons in spoofed and
fake audio detection." arXiv preprint arXiv:1904.05441 (2019).
[21] Ballesteros, Dora M., et al. "Deep4SNet: deep learning for fake speech
classification." Expert Systems with Applications 184 (2021): 115465.
[22] Lataifeh, Mohammed, et al. "Arabic audio clips: Identification and
discrimination of authentic Cantillations from imitations." Neurocomputing 418
(2020): 162-177.
[23] Lataifeh, Mohammed, and Ashraf Elnagar. "Ar-DAD: Arabic diversified audio
dataset." Data in Brief 33 (2020): 106503.
[24] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural
information processing systems 30 (2017).
[25] Bao, Hangbo, et al. "Beit: Bert pre-training of image transformers." arXiv
preprint arXiv:2106.08254 (2021).
[26] Chen, Mark, et al. "Generative pretraining from pixels." International conference
on machine learning. PMLR, 2020.
[27] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for
image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[28] Zhang, Sixiao, et al. "Graph masked autoencoders with transformers." arXiv e-
prints (2022): arXiv-2202.
[29] Gong, Yuan, et al. "Contrastive audio-visual masked autoencoder." arXiv
preprint arXiv:2210.07839 (2022).
[30] Huang, Po-Yao, et al. "Masked autoencoders that listen." Advances in Neural
Information Processing Systems 35 (2022): 28708-28720.
[31] Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. "Unsupervised visual
representation learning by context prediction." Proceedings of the IEEE
international conference on computer vision. 2015.
[32] Pathak, Deepak, et al. "Learning features by watching objects move."
Proceedings of the IEEE conference on computer vision and pattern recognition.
2017.
[33] Gidaris, Spyros, Praveer Singh, and Nikos Komodakis. "Unsupervised
representation learning by predicting image rotations." arXiv preprint
arXiv:1803.07728 (2018).
[34] Noroozi, Mehdi, and Paolo Favaro. "Unsupervised learning of visual
representations by solving jigsaw puzzles." Computer Vision–ECCV 2016: 14th
European Conference, Amsterdam, The Netherlands, October 11-14, 2016,
Proceedings, Part VI. Cham: Springer International Publishing, 2016.
[35] Wang, Xiaolong, and Abhinav Gupta. "Unsupervised learning of visual
representations using videos." Proceedings of the IEEE international conference
on computer vision. 2015.
[36] Zhang, Richard, Phillip Isola, and Alexei A. Efros. "Colorful image
colorization." Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14.
Springer International Publishing, 2016.
[37] Hadsell, Raia, Sumit Chopra, and Yann LeCun. "Dimensionality reduction by
learning an invariant mapping." 2006 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR'06). Vol. 2. IEEE, 2006.
[38] Grill, Jean-Bastien, et al. "Bootstrap your own latent-a new approach to self-
supervised learning." Advances in neural information processing systems 33
(2020): 21271-21284.
[39] Chen, Ting, et al. "A simple framework for contrastive learning of visual
representations." International conference on machine learning. PMLR, 2020.
[40] Chen, Xinlei, and Kaiming He. "Exploring simple siamese representation
learning." Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. 2021
[41] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for
language understanding." arXiv preprint arXiv:1810.04805 (2018).
[42] Baevski, Alexei, Steffen Schneider, and Michael Auli. "vq-wav2vec: Self-
supervised learning of discrete speech representations." arXiv preprint arXiv:1910.05453 (2019)..
[43] Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "Waveglow: A flow-based
generative network for speech synthesis." ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2019.
[44] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation
learning." Advances in neural information processing systems 30 (2017).
[45] Solak, Imdat. "The M-AILABS speech dataset." (2019). [Online] Available:
https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
[46] Arik, Sercan, et al. "Neural voice cloning with a few samples." Advances in
neural information processing systems 31 (2018).
[47] Reimao, Ricardo, and Vassilios Tzerpos. "For: A dataset for synthetic speech detection." 2019 International Conference on Speech Technology and Human-
Computer Dialogue (SpeD). IEEE, 2019.
[48] Lataifeh, Mohammed, and Ashraf Elnagar. "Ar-DAD: Arabic diversified audio
dataset." Data in Brief 33 (2020): 106503.
[49] Khalid, Hasam, et al. "FakeAVCeleb: A novel audio-video multimodal deepfake
dataset." arXiv preprint arXiv:2108.05080 (2021).
[50] Yi, J.; Fu, R.; Tao, J.; Nie, S.; Ma, H.; Wang, C.; Wang, T.; Tian, Z.; Bai, Y.; Fan, C. Add 2022: The first audio deep synthesis detection challenge. In Proceedings
of the IEEE International Conference on Acoustics, Speech, and Signal
Processing, Singapore, 23–27 May 2022; p. 5.
[51] Zen, Heiga, et al. "Libritts: A corpus derived from librispeech for text-to-
speech." arXiv preprint arXiv:1904.02882 (2019).
[52] Fonseca, Eduardo, et al. "Freesound datasets: a platform for the creation of open
audio datasets." Hu X, Cunningham SJ, Turnbull D, Duan Z, editors. Proceedings of the 18th ISMIR Conference; 2017 oct 23-27; Suzhou, China.[Canada]: International Society for Music Information Retrieval; 2017. p. 486-93.. International Society for Music Information Retrieval (ISMIR), 2017.
[53] Engel, Jesse, et al. "Neural audio synthesis of musical notes with wavenet
autoencoders." International Conference on Machine Learning. PMLR, 2017.
[54] Zeghidour, Neil, et al. "Soundstream: An end-to-end neural audio codec."
IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021):
495-507.
[55] Bartusiak, Emily R., and Edward J. Delp. "Frequency domain-based detection of
generated audio." arXiv preprint arXiv:2205.01806 (2022).
[56] Camacho, Steven, Dora Maria Ballesteros, and Diego Renza. "Fake speech
recognition using deep learning." Applied Computer Sciences in Engineering:
8th Workshop on Engineering Applications, WEA 2021, Medellín, Colombia,
October 6–8, 2021, Proceedings 8. Springer International Publishing, 2021.
[57] Kumar, Kundan, et al. "Melgan: Generative adversarial networks for conditional
waveform synthesis." Advances in neural information processing systems 32
(2019).