Author: |
曾鈺惠 Yu-Hui Tseng |
---|---|
Thesis Title: |
跨語言、多語者之語音合成技術及自動評估方法研究 Cross-lingual Multi-speaker Text-To-Speech Synthesis with Automatic Evaluation |
Advisor: |
林伯慎
Bor-Shen Lin |
Committee: |
楊傳凱
Chuan-Kai Yang 賴源正 Yuan-Cheng Lai |
Degree: |
碩士 Master |
Department: |
管理學院 - 資訊管理系 Department of Information Management |
Thesis Publication Year: | 2023 |
Graduation Academic Year: | 111 |
Language: | 中文 |
Pages: | 60 |
Keywords (in Chinese): | 跨語言合成 、多語者文字轉語音 、國台語語音合成 、自動化評估 |
Keywords (in other languages): | Cross lingual S ynthesis, Multi speaker Text To Speech, Speech Synthesis for Mandarin and Taiwanese, Automatic Evaluation |
Reference times: | Clicks: 311 Downloads: 23 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
文字轉語音可以應用在很多領域,像是有聲書、語言教學、網路代言等。本研究中我們希望可以建立一個跨語言、多語者的文字轉語音系統,對只會說單一語言的語者能合成出似其口音的跨語言語句。我們使用的合成模型是以Transformer為基礎的多語者文字轉語音模型。首先,將國台語料庫中的聲調和音素轉換成統一聲調和國際音標,並加入詞相對位置的B、I、E、S標記。接著,我們使用預訓練的x向量模型出計算語者特徵,並訓練單語言和跨語言的合成模型。單語言模型是以國、台語料分別訓練,能合成出單語言的語句;跨語言模型則是國、台語料混合一起訓練,其能合成出兩種語言(包括雙語切換)的語句。我們使用三種自動化方法來評估合成語音品質,包括梅爾倒頻譜失真度、語音辨識的字錯誤率、以及語者辨識的正確率,並對單語言與跨語言模型的合成語音進行實驗比較。我們發現,兩種合成模型的梅爾倒頻譜失真度均低於8,其中單語言模型的值比跨語言模型為低,代表其合成音品質較佳。在語音辨識上,混合訓練的跨語言模型能得到與單語言訓練相近的辨識效能。雖然國語語者沒有台語的訓練語料,系統所合成出這些語者的台語句依然具有可理解性,字錯誤率甚至低於原台語語者所講的台語。在語者辨識上,我們發現台語的正確率遠高於國語,這可能是由於語言的差異和國語語者數較多。而跨語言模型所生成的跨語言語句因少了語言差異,其語者辨識正確率會比單語言模型低,但依然能夠保留部分語者口音相似性。
Text-to-speech has been actively researched recently since it can be used in many fields, such as audio books, language tutors, and youtubers. The aim of this research is to build a cross-lingual multi-speaker text-to-speech system for Mandarin and Taiwanese, which can synthesize Mandarin or Taiwanese utterances for all training speakers, including Taiwanese utterances with the accents of the Mandarin speakers, and vice versa. The synthesis model used is the transformer-based text-to-speech model with speaker embeddings. First, the phonemes and tones in the Mandarin and Taiwanese databases are converted and represented with International Phonetic Alphabet (IPA) and the Unified Tone, and B-I-E-S tags are added to indicate their relative positions on word level. Next, speaker embeddings are computed from a pretrained x-vector model, and used to train the monolingual and cross-lingual synthesis models respectively. The monolingual model is trained on Mandarin and Taiwanese databases separately and can synthesize monolingual utterances, while the cross-lingual model is trained on the Mandarin and Taiwanese databases altogether and can synthesize the utterances in either language or the code-switching utterances. Three metrics are used to evaluate automatically the quality of the synthesized speeches for the monolingual and the cross-language models, including the mel-cepstral distortion, the character error rate of speech recognition, and the accuracy of speaker identification. Experimental results show that the mel-cepstral distortions for both models are lower than 8, and that of the monolingual model is a little lower, which indicates a better sound quality. In terms of speech recognition, the cross-linguistic model can achieve the performance similar to that of the monolingual model. Though the Mandarin speakers do not provide any Taiwanese training utterance, the synthesized Taiwanese utterances for those speakers are still quite comprehensible, and the character error rate is even lower than that of the original Taiwanese speakers. In terms of speaker identification, it was found that the accuracy for the synthesized Taiwanese utterances is much higher than that for the Mandarin utterances, probably due to the language difference and the larger number of Mandarin speakers. Additionally, when the utterances are synthesized across languages, e.g. the Taiwanese utterances are synthesized for the Mandarin speakers, the accuracy is lower than those synthesized from the monolingual model, but the speaker accent can still be retained.
[1] A.J. Hunt, A.W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, 1996.
[2] Alan W Black, Heiga Zen, Keiichi Tokuda, “Statistical Parametric Speech Synthesis,” 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP'07, 2007.
[3] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, Yonghui Wu, “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2018.
[4] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou, “Neural Speech Synthesis with Transformer Network,” arXiv preprint arXiv: 1809.08895, 2018.
[5] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio, “Attention-Based Models for Speech Recognition,” arXiv preprint arXiv: 1506.07503, 2015.
[6] Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, Kazuya Takeda, Tomoki Toda, “An investigation of multi-speaker training for wavenet vocoder” 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, 2017.
[7] Jungil Kong, Jaehyeon Kim, Jaekyoung Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” arXiv preprint arXiv: 2010.05646, 2020.
[8] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio” arXiv preprint arXiv: 1609.03499, 2016.
[9] Li Wan, Quan Wang, Alan Papir, Ignacio Lopez Moreno, “Generalized End-to-End Loss for Speaker Verification,” arXiv preprint arXiv: 1710.10467, 2017.
[10] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing , ICASSP, 2018
[11] David Snyder, Daniel Garcia-Romero, Daniel Povey, Sanjeev Khudanpur, “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” Proc. Interspeech 2017, pp. 999-1003, 2017.
[12] Mengnan Chen, Minchuan Chen, Shuang Liang, Jun Ma, Lei Chen, Shaojun Wang, Jing Xiao, “Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding,” Proc. Interspeech 2019, pp. 2105–2109, 2019.
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need,” arXiv preprint arXiv: 1706.03762, 2017.
[14] Chiu-yu Tseng, Yun-Ching Cheng, Wei-Shan Lee, Feng-Lan Huang, “Collecting Mandarin speech databases for prosody investigations,” in Proc. O-COCOSDA, 2003.
[15] Yuan-Fu Liao, Chia-Yu Chang, Hak-Khiam Tiun, Huang-Lan Su, Hui-Lu Khoo, Jane S. Tsay, Le-Kun Tan, Peter Kang, Tsun-guan Thiann, Un-Gian Iunn, Jyh-Her Yang, Chih-Neng Liang, “Formosa Speech Recognition Challenge 2020 and Taiwanese Across Taiwan Corpus,” 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA, 2020.
[16] Yuan-Fu Liao, Jane S. Tsay, Peter Kang, Hui-Lu Khoo, Le-Kun Tan, Li-Chen Chang, Un-Gian Iunn, Huang-Lan Su, Tsun-Guan Thiann, Hak-khiam Tiun, Su-Lian Liao “Taiwanese Across Taiwan Corpus And Its Applications,” 2022 25th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA, 2022.
[17] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” arXiv preprint arXiv: 1804.00015, 2018.
[18] Hsin-Min Wang, Berlin Chen, Jen-Wei Kuo, Shih-Sian Cheng, “MATBN: A Mandarin Chinese Broadcast News Corpus,” International Journal of Computational Linguistics & Chinese Language Processing, vol. 10, pp. 219–236, 2005.
[19] Tzu-Yu Liao, Ren-yuan Lyu, Ming-Tat Ko, Yuang-chin Chiang, Jyh-Shing Jang, “台語文字與語音語料庫之建置,” Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, ROCLING 2012.
[20] Chen Yan, Guoming Zhang, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu, “The Feasibility of Injecting Inaudible Voice Commands to Voice Assistants,” IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 3, pp. 1108-1124, 2021.
[21] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, Tomoki Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
[22] D. Griffin, Jae Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, , vol. 32, no. 2, pp. 236–243, 1984.
[23] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. “The kaldi speech recognition toolkit,” In IEEE 2011 workshop on automatic speech recognition and understanding, number CONF. IEEE Signal Processing Society, 2011.