簡易檢索 / 詳目顯示

研究生: 熊心平
HSIN-PING HSIUNG
論文名稱: 使用端對端的一維卷積神經網路於合成語音檢測
End-to-End 1D Convolutional Neural Network for Synthetic Speech Detection
指導教授: 吳怡樂
Yi-Leh Wu
口試委員: 唐政元
Zheng-Yuan Tang
陳建中
Jian-Zhong Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 28
中文關鍵詞: 端對端合成語音檢測一維卷積神經網路
外文關鍵詞: End-to-End, Synthetic Speech Detection, ASV2019
相關次數: 點閱:145下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

現存的多數使用深度神經網路架構作為合成語音檢測的方法,都需透過對輸入的語音波形資料進行預前處理,像是特徵的抽取,再將處理過的資料輸入進深度神經網路進行學習。儘管透過特徵的抽取後再使用強大的深度神經網路進行訓練對預測結果有不錯的成效,但是特徵抽取是非常耗時且耗費電腦空間資源的工作,為此本文透過一個端對端的一維卷積神經網路架構,省去對於輸入波形資料的預處理,直接將波形資料輸入進一維卷積神經網路,便能得到合成語音檢測的預測結果。在網路的架構上我們採用了聚合型殘差轉換的深度神經網路。實驗結果也表明本文提出的端對端網路架構能有效進行合成語音檢測的任務,且比多數使用傳統的資料預處理、特徵抽取方式後再進行訓練的方法要來的更好。


Most of the existing approaches for synthetic speech detection by using Deep Learning are followed by feature extraction work and Deep Neural Network (DNN) Architecture. Even though using data pre-transform, like feature extraction, can help the network trained model better, it costs extra time to transform the input waveform data and also requires a large amount of data storage space. Therefore, we proposed an End-to-End 1D Convolutional Neural Network Architecture to achieve the synthetic speech detection task. Our proposed method can just input the original waveform data into the end-to-end framework directly, after that we can get the output results for synthetic speech detection. We try to adopt the Aggregated Residual Transformations in the neural network. In the experiment, we demonstrate that our proposed model has achieved very good accuracy in the ASV2019 LA dataset and also outperforms lots of the traditional DNNs with feature extraction approaches. We also test our model on ASVspoof 2015 dataset to do the cross-data testing. It shows its strong ability on synthetic speech detection in different dataset.

論文摘要 I Abstract II Contents III LIST OF FIGURES IV LIST OF TABLES V Chapter 1. Introduction Chapter 2. Related Work 2.1 Raw Waveform-based DNNs 2.2 Inception Network 2.3 Deep Residual Neural Networks 2.3.1 ResNet 2.3.2 ResNeXt Chapter 3. Proposed Method 3.1 Baseline Architecture 3.2 ResNeXt Block 3.3 Proposed Architecture Chapter 4. Experiments 4.1 Dataset 4.2 Training Details 4.3 Evaluate Metrics 4.4 Performance of ResNeXt style TSSDNet 4.5 Performance on ASVspoof2015 dataset 4.6 The Impact of Branch Number Chapter 5. Conclusions and Future Works References

References
[1] Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang, “Towards End-to-End Synthetic Speech Detection”2021 IEEE Signal Processing Letters Volume: 28, pp 1265-1269, 2021.
[2] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee,“ASVspoof 2019: Future horizons in spoofed and fake audio detection,” in Proc. Interspeech, pp. 1008–1012, 2019.
[3] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5987–5995. IEEE, 2017.
[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 770–778, 2016.
[5] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, 2015.
[6] Yamagishi, Junichi; Todisco, Massimiliano; Sahidullah, Md; Delgado, Héctor; Wang, Xin; Evans, Nicolas; Kinnunen, Tomi; Lee, Kong Aik; Vestman, Ville; Nautsch, Andreas. (2019). ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database, [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/2555.
[7] University of Edinburgh. The Centre for Speech Technology Research (CSTR) “SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit” https://doi.org/10.7488/ds/2645
[8] D. Matrouf, J.-F. Bonastre, and C. Fredouille, “Effect of speech transformation on impostor acceptance,” in Proc. ICASSP, vol. 1.IEEE, pp. 933–936, 2006.
[9] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoderbased high-quality speech synthesis system for real-time applications,” IEICE Trans. on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
[10] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio,” arXiv preprintarXiv:1609.03499, 2016
[11] D. Griffin and J. Lim, “Signal estimation from modified shorttime Fourier transform,” IEEE Trans. ASSP, vol. 32, no. 2, pp.236–243, 1984
[12] K. Tanaka, T. Kaneko, N. Hojo, and H. Kameoka, “Synthetic-tonatural speech waveform conversion using cycle-consistent adversarial networks,” in Proc. SLT. IEEE, pp. 632–639, 2018.
[13] Andreas Nautsch, Member, IEEE, Xin Wang, Member, IEEE, Nicholas Evans, Member, IEEE,Tomi Kinnunen, Member, IEEE, Ville Vestman, Massimiliano Todisco, Member, IEEE, Hector Delgado, Md Sahidullah, Member, IEEE, Junichi Yamagishi, Senior Member, IEEE,and Kong Aik Lee, Senior Member, IEEE, “ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech” 2019
[14] Diederik P Kingma and Jimmy J arXiv preprint arXiv:. Ba, "Adam: A method for stochastic optimization," 2014.
[15] S. Jothilakshmi, V.N. Gudivada, in Handbook of Statistics, Cognitive Computing: Theory and Applications, 2016
https://www.sciencedirect.com/topics/computer-science/equal-error-rate
[16] J. Yang, R. K. Das, and H. Li, “Significance of subband features for synthetic speech detection,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 2160–2170, 2020.
[17] H. Zeinali, T. Stafylakis, G. Athanasopoulou, J. Rohdin, I. Gkinis, L. Burget, and J. Cernock ´y, “Detecting Spoofing Attacks Using VGG ˇ and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge,” in Proc. Interspeech, pp. 1073–1077, 2019.
[18] C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,” in Proc. Interspeech 2019, pp. 1013–1017, 2019.
[19] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov, and A. Kozlov, “STC antispoofing systems for the ASVspoof2019 challenge,” in Proc. Interspeech, pp. 1033–1037, 2019.
[20] H´ector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu,Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, Junichi Yamagishi, “ASVspoof 2021:Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan”, 2021.
[21] Pavel Korshunov and Sebastien Marcel. Deepfakes: a new threat to face recognition? assessment and detection, 2018.
[22] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilci, M. Sahidullah, and A. Sizov, “ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Proc. Interspeech, pp. 1–5, 2015.
[23] Y. Gao, R. Singh, and B. Raj, “Voice impersonation using generative adversarial networks,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), pp. 2506–2510, Apr. 2018.
[24] S. O. Arık, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice ¨ cloning with a few samples,” in Proc. the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pp. 10 019– 10 029, 2018.
[25] H. Muckenhirn, V. Abrol, M. Magimai-Doss, and S. Marcel, “Understanding and Visualizing Raw Waveform-Based CNNs,” in Proc. Interspeech 2019, pp. 2345–2349, 2019.
[26] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal timefrequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
[27] Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony, “End-to-End anti-spoofing with RawNet2”, “IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)”, pp. 6369–6373, 2021.

QR CODE