簡易檢索 / 詳目顯示

研究生: 胡志文
Zhi-Wen Hu
論文名稱: 自動化閩南語語音合成及遷移學習系統框架
A framework of automatic Hokkien speech synthesis and transfer learning system
指導教授: 戴文凱
Wen-Kai Tai
口試委員: 戴文凱
賴佑吉
魏德樂
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 33
中文關鍵詞: 閩南語深度學習語音合成自動語音辨識遷移學習自監督學習
外文關鍵詞: Hokkien, Deep Learning, Speech Synthesis, Automatic Speech Recognition, Transfer Learning, Self-Supervised Learning
相關次數: 點閱:86下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

近年來,深度學習在眾多科學領域取得了巨大的成功。對於文本轉語音(Text-to-speech)技術而言,深度學習提升了系統的整體性能和效率。從雙階段模型到端對端模型,過去幾年間語音合成模型的品質和成本都取得了顯著的改進。然而,就像其他深度學習模型一樣,語音合成模型的表現很大程度受到訓練資料品質的影響。要訓練一個品質較好的語音合成模型需要大量的文本及語音配對數據,在大多數情況下,這些數據通常難以收集和生成,更何況是收集一些較為罕見的語言資料,這些語言通常保存不良或甚至沒有常見的書寫系統,比如閩南語。

本論文旨在開發一個針對閩南語語音合成系統的自監督學習框架,並為閩南語語音合成應用提供一個更有效率的產出模式。該框架能直接使用從影音網站中收集的原始音頻資料並透過語者自動分段標記、語音降噪、調音效果器等音訊處理方法來產生語音資料集,並利用自動語音識別(Automatic speech recognition)及自動化閩南語資料清理流程來產生文字資料集,透過遷移學習的方法簡易且快速訓練出高品質的語音合成模型。同時,框架中也提供不同的輸入效果來控制語音合成的結果,藉此合成出更生動的語音結果。

最終實驗顯示本論文提出之框架中的基礎模型能有效進行閩南語遷移學習,且透過自動化資料處理流程產出之閩南語資料集透過遷移學習,能快速達到甚至超越人工標籤資料集訓練之模型。框架中針對語音合成的控制效果也能有效提升語音合成結果之品質。


Deep learning has achieved great success in numerous scientific fields in recent years, it successfully leveraged the overall performance and efficiency of building up a TTS (Text-to-speech) system. From two-stage models to end-to-end models, the quality and cost of a TTS model have substantially improved over the last few years. But just like all the other deep learning models, the TTS (Text-to-speech) model is heavily biased by the data used for training. To well-train a TTS (Text-to-speech) model, it requires a lot of text-audio pair data with good quality. In most cases, it is a very difficult job to collect and produce such data, not to mention to collect data for those rare languages that are generally not well-preserved or might not even have a common writing system, like Hokkien.

In this thesis, we aim to develop a self-supervised learning framework for Hokkien TTS (Text-to-speech) system and bring out a productive method for Hokkien speech synthesis application. The proposed framework can use raw audio data directly collected from internet media to create a Hokkien audio dataset by audio processes like speaker diarization, audio denoising, effects units. By utilizing an ASR (Automatic speech recognition) system and automatic Hokkien data cleaning process, we can create corresponding Hokkien text dataset and easily train a high quality TTS (Text-to-speech) models with it. Also, the proposed framework provides different input tags when inference to leverage the controllability of speech synthesis and create more vivid speech results.

Based on our experiment results, the base model of the proposed framework is proven to be effective for Hokkien TTS (Text-to-speech) tansfer learning. The model fine-tuned with the Hokkien dataset which is automatic generated by proposed framework is also proven to reach the performance of the model fine-tuned with the human-labeled Hokkien dataset. The input tags for controlling the synthesis results when inference are also proven to be effective for leveraging synthesis quality.

Recommendation Letter i Approval Letter ii Abstract in Chinese iii Abstract in English iv Acknowledgements v Contents vi List of Figures viii List of Tables ix 1 Introduction 1 1.1 Background and Motivation 1 1.2 Research Goals 2 1.3 Overview of Our Method 2 1.3.1 Data Pre-Processing 2 1.3.2 Model Training 2 1.3.3 Controllability 3 1.4 Contributions 3 2 Related Work 5 2.1 Text-to-Speech System 5 2.2 Automated Speech Recognition 9 3 Proposed Method 11 3.1 Raw Data Processing 12 3.1.1 Speech Separation 12 3.1.2 Speaker Diarization 12 3.1.3 Audio Data Adjustment 13 3.1.4 ASR Model 15 3.1.5 Data Cleaning 16 3.2 Base Model 19 3.2.1 Dataset 19 3.2.2 VITS2 20 3.3 Speech Synthesis 20 4 Experiment 24 4.1 Effects Units Settings 24 4.2 Fine Tune Dataset 24 4.3 Training Detail 25 4.4 MOS Evaluation 25 5 Conclusions 30 5.1 Future Work 30 References 31

[1] Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
[2] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[3] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
[4] demucs (2023). facebookresearch. https://github.com/facebookresearch/demucs.
[5] drfeinberg (2021). Praatscripts. https://github.com/drfeinberg/PraatScripts.
[6] Dudley, H. (1940). The carrier nature of speech. Bell System Technical Journal, 19(4):495–515.
[7] Foundation, T. C. C. (2022). https://global.tzuchi.org/about-us_our-founder.
[8] Google (2005). Youtube. https://www.youtube.com.
[9] Griffin, D. and Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243.
[10] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.
[11] Ito, K. and Johnson, L. (2017). The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/.
[12] Joly, A., Nicolis, M., Peterova, E., Lombardi, A., Abbas, A., van Korlaar, A., Hussain, A., Sharma, P., Moinet, A., Lajszczak, M., et al. (2023). Controllable emphasis with zero data for text-to-speech. arXiv preprint arXiv:2307.07062.
[13] Kim, J., Kim, S., Kong, J., and Yoon, S. (2020). Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077.
[14] Kim, J., Kong, J., and Son, J. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.
[15] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[16] Kong, J., Park, J., Kim, B., Kim, J., Kong, D., and Kim, S. (2023). Vits2: Improving quality and efficiency of single-stage text-to-speech with adversarial learning and architecture design. arXiv preprint arXiv:2307.16430.
[17] Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and Van Der Maaten, L. (2018). Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196.
[18] O’shea, K. and Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458.
[19] Paul, B. and David, W. (2003). Praat: doing phonetics by computer. https://www.fon.hum.uva.nl/praat/.
[20] Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., et al. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52.
[21] pyannote (2023). pyannote-audio. https://github.com/pyannote/pyannote-audio.
[22] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR.
[23] Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
[24] Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR.
[25] Schmidt, R. M. (2019). Recurrent neural networks (rnns): A gentle introduction and overview. arXiv preprint arXiv:1912.05911.
[26] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., et al. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE.
[27] spotify (2023). pedalboard. https://github.com/spotify/pedalboard.
[28] Taguchi, C., Sakai, Y., Haghani, P., and Chiang, D. (2023). Universal automatic phonetic transcription into the international phonetic alphabet. arXiv preprint arXiv:2308.03917.
[29] timsainb (2023). noisereduce. https://github.com/timsainb/noisereduce.
[30] Tseng, C.-y. and Lee, Y.-l. (2004). Speech rate and prosody units: Evidence of interaction from mandarin chinese. In Speech Prosody 2004, International Conference.
[31] twardoch (2023). audiostretchy. https://github.com/twardoch/audiostretchy.
[32] Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K., et al. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12.
[33] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[34] Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
[35] 中華民國教育部(2007). 台灣閩南語羅馬字拼音方案使用手冊. https://ws.moe.edu.tw/001/Upload/FileUpload/3677-15601/Documents/tshiutsheh.pdf.
[36] 意傳科技有限公司(2019). Suisiann dataset. https://suisiann-dataset.ithuan.tw/.
[37] 行政院(2020). 109 年人口及住宅普查初步統計結果提要分析. 行政院主計處.

無法下載圖示
全文公開日期 2028/08/21 (校外網路)
全文公開日期 2028/08/21 (國家圖書館:臺灣博碩士論文系統)
QR CODE