研究生: 胡志文
Zhi-Wen Hu
論文名稱: 自動化閩南語語音合成及遷移學習系統框架
A framework of automatic Hokkien speech synthesis and transfer learning system
指導教授: 戴文凱
Wen-Kai Tai
口試委員: 戴文凱
學位類別: 碩士
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 33
中文關鍵詞: 閩南語深度學習語音合成自動語音辨識遷移學習自監督學習
外文關鍵詞: Hokkien, Deep Learning, Speech Synthesis, Automatic Speech Recognition, Transfer Learning, Self-Supervised Learning
本論文旨在開發一個針對閩南語語音合成系統的自監督學習框架,並為閩南語語音合成應用提供一個更有效率的產出模式。該框架能直接使用從影音網站中收集的原始音頻資料並透過語者自動分段標記、語音降噪、調音效果器等音訊處理方法來產生語音資料集,並利用自動語音識別(Automatic speech recognition)及自動化閩南語資料清理流程來產生文字資料集,透過遷移學習的方法簡易且快速訓練出高品質的語音合成模型。同時,框架中也提供不同的輸入效果來控制語音合成的結果,藉此合成出更生動的語音結果。


Deep learning has achieved great success in numerous scientific fields in recent years, it successfully leveraged the overall performance and efficiency of building up a TTS (Text-to-speech) system. From two-stage models to end-to-end models, the quality and cost of a TTS model have substantially improved over the last few years. But just like all the other deep learning models, the TTS (Text-to-speech) model is heavily biased by the data used for training. To well-train a TTS (Text-to-speech) model, it requires a lot of text-audio pair data with good quality. In most cases, it is a very difficult job to collect and produce such data, not to mention to collect data for those rare languages that are generally not well-preserved or might not even have a common writing system, like Hokkien.

In this thesis, we aim to develop a self-supervised learning framework for Hokkien TTS (Text-to-speech) system and bring out a productive method for Hokkien speech synthesis application. The proposed framework can use raw audio data directly collected from internet media to create a Hokkien audio dataset by audio processes like speaker diarization, audio denoising, effects units. By utilizing an ASR (Automatic speech recognition) system and automatic Hokkien data cleaning process, we can create corresponding Hokkien text dataset and easily train a high quality TTS (Text-to-speech) models with it. Also, the proposed framework provides different input tags when inference to leverage the controllability of speech synthesis and create more vivid speech results.

Based on our experiment results, the base model of the proposed framework is proven to be effective for Hokkien TTS (Text-to-speech) tansfer learning. The model fine-tuned with the Hokkien dataset which is automatic generated by proposed framework is also proven to reach the performance of the model fine-tuned with the human-labeled Hokkien dataset. The input tags for controlling the synthesis results when inference are also proven to be effective for leveraging synthesis quality.

Recommendation Letter i Approval Letter ii Abstract in Chinese iii Abstract in English iv Acknowledgements v Contents vi List of Figures viii List of Tables ix 1 Introduction 1 1.1 Background and Motivation 1 1.2 Research Goals 2 1.3 Overview of Our Method 2 1.3.1 Data Pre-Processing 2 1.3.2 Model Training 2 1.3.3 Controllability 3 1.4 Contributions 3 2 Related Work 5 2.1 Text-to-Speech System 5 2.2 Automated Speech Recognition 9 3 Proposed Method 11 3.1 Raw Data Processing 12 3.1.1 Speech Separation 12 3.1.2 Speaker Diarization 12 3.1.3 Audio Data Adjustment 13 3.1.4 ASR Model 15 3.1.5 Data Cleaning 16 3.2 Base Model 19 3.2.1 Dataset 19 3.2.2 VITS2 20 3.3 Speech Synthesis 20 4 Experiment 24 4.1 Effects Units Settings 24 4.2 Fine Tune Dataset 24 4.3 Training Detail 25 4.4 MOS Evaluation 25 5 Conclusions 30 5.1 Future Work 30 References 31

