簡易檢索 / 詳目顯示

研究生: 王小禾
Siao-He Wang
論文名稱: 卷積神經網路應用於雙重聲學特徵之偽造語音檢測
Convolutional Neural Network Using Dual Acoustic Features Approaches to Audio Forgery Detection
指導教授: 陳俊良
Jiann-Liang Chen
口試委員: 陳俊良
Jiann-Liang Chen
郭斯彥
Sy-Yen Kuo
陳英一
Ing-Yi Chen
胡誌麟
Chih-Lin Hu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 62
中文關鍵詞: 深偽語音語音合成VGG-16梅爾頻率倒譜係數展局部三值模式
外文關鍵詞: Audio deepfake, Speech synthesis, VGG-16, Mel-frequency cepstral coefficients, Extended local ternary pattern
相關次數: 點閱:215下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人工智慧技術逐漸的發展,在聲學領域也有相關應用,其中深偽語音(Audio Deepfake)是利用深度學習方法合成或修改的語音,這些被創建的語音聽起來像是特定的人在說他們沒有說過的話。隨著合成語音的真實性越來越高,一般人也難以區分真實人聲與電腦合成的語音。然而深偽語音的惡意使用會對我們的日常生活造成嚴重的影響,例如語音詐騙、虛假證詞。因此,本研究旨在提出一種新的架構來檢測深偽語音。
    深偽語音檢測任務目標是如何從語音中提取出篡改或合成痕跡。本研究結合深度學習和訊號處理技術,提出了一種基於 VGG-16 的雙輸入模型,使用梅爾頻率倒譜係數(Mel-Frequency Cepstral Coefficients)和擴展局部三值模式(Extended Local Ternary Patterns)做為雙特徵捕捉聲道動態痕跡,增強捕獲語音特徵的方法。本研究選擇 APTLY lab 創建的 Fake-or-Real 資料集,該資料集是使用文字轉語音(Text-To-Speech,TTS)最新深度學習語音合成器模型創建的標準資料集。其中包含了for-2-sec、for-norm 和 for-original等三個子資料集。此外,本研究也使用 Asvspoof 2019 資料集中 GAN 生成的深偽語音的部分資料集,以擴充標準資料集。
    本研究提出之模型在 for-original、for-norm 和 for-2-sec 資料集混合中分別得出 94.21%、93.23% 以及 97.64% 的準確率。實驗結果表明,本研究所提出的方法優於先前的研究成果,具有更高的深偽語音檢測準確性。


    Artificial intelligence (AI) has gradually advanced, and its applications in acoustics have considerable attention. Among these applications, Audio Deepfake refers to the synthesis or modification of speech using deep learning techniques. These created voices sound like specific people are saying what they did not speak. With the growing realism of synthesized speech, distinguishing between genuine human voices and computer-generated speech has become challenging for the general populace. Unfortunately, the audio deepfake technology poses severe implications for our everyday lives, including fraudulent voice impersonation and false testimonies. Therefore, this study proposes a framework for the detection of audio deepfake.
    The primary objective of audio deepfake detection is to identify tampering or synthesis within speech recordings. This study proposes a dual-input model based on the VGG-16 architecture by combining deep learning and signal processing techniques. The model uses Mel-frequency cepstral coefficients (MFCC) and Extended local ternary pattern (ELTP) as dual features to capture the dynamic acoustic properties of the vocal tract, thereby enhancing the model's capability to extract speech characteristics. This study has selected the Fake-or-Real dataset, meticulously crafted by the APTLY lab, which represents a benchmark dataset created using state-of-the-art Text-To-Speech (TTS) deep learning speech synthesis models. This study uses three distinct subsets: for-2-sec, for-norm, and for-original. Furthermore, this study also uses a partial data set of Audio deepfake generated by GAN in the Asvspoof 2019 data set to expand the standard data set.
    Through experimentation, the proposed model achieves accuracy rates of 94.21%, 93.23%, and 97.64% on the mixed for-original, for-norm, and for-2-sec datasets, respectively. The data results show that the proposed audio deepfake detection model outperforms the previous studies.

    摘要 I Abstract II List of Figures VI List of Tables VII Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contributions 5 1.3 Organization 6 Chapter 2 Related Work 7 2.1 Audio Deepfake Categories 7 2.1.1 Replay Attacks 7 2.1.2 Speech Synthesis 8 2.1.3 Voice Impersonation 8 2.2 Audio Feature Extraction 9 2.2.1 MFCC 9 2.2.2 LFCC 11 2.2.3 CQCC 12 2.3 Detection of Models 13 2.3.1 VGG-16 13 2.3.2 ResNet 13 Chapter 3 Proposed System 15 3.1 System Architecture 15 3.2 Audio Collection 16 3.2.1 Fake-or-Real Dataset 16 3.2.2 Asvspoof 2019 Dataset 19 3.2.3 Dataset Merging 23 3.3 Audio Preprocessing 24 3.3.1 MFCC Feature 24 3.3.2 ELTP Feature 26 3.4 Detection Model Architecture 30 Chapter 4 Performance Analysis 32 4.1 ELTP Analysis 32 4.2 System Environment and Parameter Settings 33 4.3 Performance Evaluation Metrics 35 4.4 Performance Analysis 37 4.4.1 For-original Analysis 37 4.4.2 For-norm Analysis 39 4.4.3 For-2seconds Analysis 40 4.5 Performance Comparison 42 Chapter 5 Conclusions and Future Works 45 5.1 Conclusions 45 5.2 Future Works 46 References 48

    [1] F. R. Moore, “Digital Audio,” in Elements of Computer Music, Prentice Hall, 1990.
    [2] J. Chadabe, “Computer Music,” in Electric Sound: The Past and Promise of Electronic Music, Prentice Hall, 1996.
    [3] B. Gold, N. Morgan, D. Ellis, “Synthetic Audio: A Brief History,” in Speech and Audio Signal Processing: Processing and Perception of Speech and Music, Wiley, pp. 9-20, 2011.
    [4] A. L. Iñiguez-Carrillo, L. S. Gaytán-Lugo, R. Maciel-Arellano, M. A. García-Ruiz, and D. Aréchiga, “The state of voice user interfaces in Latin America,” Proceedings of the Avances en Interacción Humano-Computadora, no. 1, pp. 28-36, 2020.
    [5] “Global AI Voice Generator Market By Deployment (On-premise and Cloud Based), By End-User Industries (Healthcare, BFSI, Manufacturing, Advertising & Media, and Other End-use Industries), By Region and Companies - Industry Segment Outlook, Market Assessment, Competition Scenario, Trends, and Forecast 2023-2032.” Market.us. https://market.us/report/ai-voice-generator-market/ (accessed Jun. 13, 2023).
    [6] D. Yalalov. “Top 7 AI voice generators and voice cloning for text-to-speech.” mpost.io. https://mpost.io/top-7-ai-voice-generators-and-voice-cloning-for-text-to-speech/ (accessed Jun. 12, 2023).
    [7] L. Whittaker, T. C. Kietzmann, J. Kietzmann and A. Dabirian, ““All Around Me Are Synthetic Faces”: The Mad World of AI-Generated Media,” IT Professional, vol. 22, no. 5, pp. 90-99, 2020.
    [8] C. Stupp. “Fraudsters used Ai to mimic CEO’s voice in unusual cybercrime case.” Wall Street J. https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402 (accessed Feb. 12, 2023)
    [9] Z. Khanjani, G. Watson, V. P. Janeja, “How Deep Are the Fakes? Focusing on Audio Deepfake: A Survey,” arXiv:2111.14203 [cs.SD], 2021.
    [10] S. Pradhan, W. Sun, G. Baig and L. Qiu, “Combating Replay Attacks Against Voice Assistants,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, pp. 1-26, 2019.
    [11] J. Villalba and E. Lleida, “Preventing replay attacks on speaker verification systems,” Proceedings of the 2011 Carnahan Conference on Security Technology, pp. 1-8, 2011.
    [12] D. Qinsheng, Z. Jian, W. Lirong and S. Lijuan, “Articulatory Speech Synthesis: A Survey,” Proceedings of the 2011 14th IEEE International Conference on Computational Science and Engineering, pp. 539-542, 2011.
    [13] J. Liu, C. Li, Y. Ren, F. Chen and Z. Zhao, “DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism,” arXiv:2105.02446v6 [eess.AS], 2022.
    [14] Y. Gao, R. Singh and B. Raj, “Voice Impersonation Using Generative Adversarial Networks,” Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2506-2510, 2018.
    [15] B. Sisman, K. Vijayan, M. Dong and H. Li, “SINGAN: Singing Voice Conversion with Generative Adversarial Networks,” Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 112-118, 2019.
    [16] S. Dhar, N. D. Jana and S. Das, “An Adaptive-Learning-Based Generative Adversarial Network for One-to-One Voice Conversion,” IEEE Transactions on Artificial Intelligence, vol. 4, no. 1, pp. 92-106, 2023.
    [17] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
    [18] L. Zhiming and M. Sheng, “Heart sound recognition method of congenital heart disease based on improved cepstrum coefficient features,” Proceedings of the 2021 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), pp. 319-324, 2021.
    [19] A. Hamza et al., “Deepfake Audio Detection via MFCC Features Using Machine Learning,” IEEE Access, vol. 10, pp. 134018-134028, 2022.
    [20] X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson and S. Shamma, “Linear versus mel frequency cepstral coefficients for speaker recognition,” Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, pp. 559-564, 2011.
    [21] S. P. Dewi, A. L. Prasasti and B. Irawan, “Analysis of LFCC Feature Extraction in Baby Crying Classification using KNN,” Proceedings of the 2019 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS), pp. 86-91, 2019.
    [22] A. Chaiwongyen, K. Pinkeaw, W. Kongprawechnon, J. Karnjana and M. Unoki, “Replay Attack Detection in Automatic Speaker Verification Based on ResNeWt18 with Linear Frequency Cepstral Coefficients,” Proceedings of the 2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1-5, 2021.
    [23] J. Yang, R. K. Das and H. Li, “Extended Constant-Q Cepstral Coefficients for Detection of Spoofing Attacks,” Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1024-1029, 2018.
    [24] J. Zhan, Z. Pu, W. Jiang, J. Wu and Y. Yang, “Detecting Spoofed Speeches via Segment-Based Word CQCC and Average ZCR for Embedded Systems,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 3862-3873, 2022.
    [25] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-
    Scale Image Recognition,” arXiv:1409.1556v6 [cs.CV], 2015.
    [26] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
    [27] Audio Processing Techniques Lab at York (APTLY), “The Fake-or-Real Dataset,” 2019. [Online]. Available: https://bil.eecs.yorku.ca/datasets/
    [28] J. Yamagishi et al., “ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database,” 2019 [Online]. Available: https://datashare.ed.ac.uk/handle/10283/3336
    [29] R. Reimao and V. Tzerpos, “FoR: A Dataset for Synthetic Speech Detection,” Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1-10, 2019.
    [30] M. Todisco et al., “ASVspoof 2019: Future horizons in spoofed/fake audio detection,” arXiv:1904.05441v2 [eess.AS], 2019.
    [31] J. Yamagishi et al., “ASVspoof2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan,”, 2019.
    [32] A. Irtaza, S. M. Adnan, S. Aziz, A. Javed, M. O. Ullah and M. T. Mahmood, “A framework for fall detection of elderly people by analyzing environmental sounds through acoustic local ternary patterns,” Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1558-1563, 2017.
    [33] W. -H. Liao, C. -Y. Liu and M. -C. Lin, “Feature Description Using Center-Symmetric Extended Local Ternary Patterns,” Proceedings of the 2014 IEEE International Symposium on Multimedia, pp. 94-97, 2014.
    [34] W. -H. Liao and T. -J. Young, “Texture Classification Using Uniform Extended Local Ternary Patterns,” Proceedings of the 2010 IEEE International Symposium on Multimedia, pp. 191-195, 2010.
    [35] T. Arif, A. Javed, M. Alhameed, F. Jeribi and A. Tahir, “Voice Spoofing Countermeasure for Logical Access Attacks Detection,” IEEE Access, vol. 9, pp. 162857-162868, 2021.
    [36] Logan Blue et al., “Who Are You (I Really Wanna Know)? Detecting Audio DeepFakes Through Vocal Tract Reconstruction,” Proceedings of the 31st USENIX, 2022.
    [37] C. Lynch, K. Aryafar, and J. Attenberg, “Images Don't Lie: Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to Rank,” arXiv:1511.06746v1 [cs.CV], 2015.
    [38] R. Reimao and V. Tzerpos, "Synthetic Speech Detection Using Neural Networks," Proceedings of the 2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 97-102, 2021.

    QR CODE