Author: |
陳忠緯 Chung-wei Chen |
---|---|
Thesis Title: |
用於英語語音合成之基週軌跡產生方法 A Pitch Contour Generation Method for English Speech Synthesis |
Advisor: |
古鴻炎
Hung-Yan Gu |
Committee: |
洪維廷
Wei-Tyng Hong 陳柏琳 Ber-Lin Chen 林彥君 Yen-Chun Lin 林伯慎 Bor-Shen Lin |
Degree: |
碩士 Master |
Department: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
Thesis Publication Year: | 2010 |
Graduation Academic Year: | 98 |
Language: | 中文 |
Pages: | 66 |
Keywords (in Chinese): | 基週軌跡 、聲調預測 、語音信號合成 |
Keywords (in other languages): | pitch contour, tone predict, speech signal synthesis |
Reference times: | Clicks: 556 Downloads: 2 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
本論文研究了英語語音合成之基週軌跡產生的方法。基週軌跡產生的第一階段工作是預測英語音節的聲調類別,我們提出以兩層式的演算法來作聲調預測,第一層透過動態規劃來尋找最佳的聲調組合之狀態序列,第二層則是用以估計各個音節的局部聲調機率,我們研究了三種局部聲調機率的估計方法,分別為加權式估計法、PPM估計法及類神經網路估計法。接著在基週軌跡產生的第二階段,我們把預測出的聲調、及其它語境資料帶入一個類神經網路來產生出各音節的基週軌跡。然後我們採用規則式作法來設定音量、音長及停頓,語音信號合成則是採用HNM合成法。目前已初步建立一個英語的語音合成系統,並且用以進行系統內聽測之實驗,我們發現聲調預測正確率越高,則合成語音的自然度會愈好;另外也進行了系統間聽測之實驗,結果顯示我們系統的合成語音的自然度,仍然比Festival HTS的差一截。
In this thesis, a pitch contour generation method for english speech synthesis is studied. The first phase of pitch contour generation is to predict the tone class of each syllable. We have proposed a two-tier algorithm to predict syllable tone classes for a sentence. The first tier is to find the best sequence of tone class combined states by using a dynamic programming based algorithm. In the second tier, for each tone combined state of a syllable, its local probability is estimated. We have studied three local probability estimation methods, namely, the weighted method, PPM based method and artificial neural network (ANN) based method. In the second phase, we take the predict tones and other contextual information into another ANN to generate a pitch contour for each syllable. Then, we use heuristic rules to set the volume, duration and pause of each syllable. Next, speech signal is synthesized by using the method of harmonic-plus-noise model. Therefore, we have initially built an English speech synthesis system. This system is then used to conduct listening tests. We find that as the accuracy of syllable tone prediction becomes higher, the naturalness of the synthesized speech will become better. Also inter-system listening tests have been conducted. The results show that our system’s naturalness level is still significantly lower than that of Festival HTS system.
[1] 周彥佐,基於HNM之國語、閩南語的語音合成研究,國立台灣科技大學資訊工程研究所碩士論文,台北,2007。
[2] 梁弘學,英語歌聲合成之研究,國立台灣科技大學資訊工程研究所碩士論文,台北,2009。
[3] R. A. J. Clark, “Generating Synthetic Pitch Contours Using Prosodic Structure”, PhD thesis, Edinburgh, 2003 .
[4] K. Silverman et al, “ToBI : A Standard for Labelling English Prosody”, Proceedings of the 1992 International Conference on Spoken Language Processing, pp. 867 – 870, Banff, 1992.
[5] P. Taylor, “The tilt intonation model” in ICSLP ’98, Sydney, 1998.
[6] J. E. Cahn, “Generating expression in synthesized speech. Technical Report.” Boston: MIT Media Lab, Cambridge, 1990.
[7] J. B. Pierrehumbert, “The phonology and phonetics of English intonation”, Ph.D. Thesis, MIT, Cambridge, 1980.
[8] The Centre for Speech Technology Research, The Festival Speech Synthesis System. http://www.cstr.ed.ac.uk/projects/festival/
[9] HTS working group , HMM-based Speech Synthesis System (HTS).
http://hts.sp.nitech.ac.jp/
[10] AT&T, Natural Voices. http://www.naturalreaders.com/
[11] NCH, Verbose Text to Speech Converter.
http://www.nch.com.au/verbose/index.html
[12] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, “Classification and regression trees”, CRC Press, United Kingdom, 1998
[13] D. C. Montgomery, E.A. Peck, “Introduction to linear regression analysis”, Wiley, Hoboken, 2007
[14] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, “Duration Modeling in HMM-based Speech Synthesis System,“ Proc. of ICSLP, Sydney, 1998.
[15] 曹亦岑,使用小型語料類神經網路之國語語音合成韻律參數產生,國立台灣科技大學電機所,碩士論文,台北,1999。
[16] Carnegie Mellon University, The CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/speech/
[17] K. Sjlander and J. Beskow, Centre of Speech Technolodge at KTH, http://www.speech.kth.se/wavesurfer/
[18] S. J. Lee, K. C. Kim, H. Y. Jung, and W. Cho, “Application of Fully Recurrent Neural Networks for Speech Recognition”, ICASSP, pp. 77-80, South Korea, 1991.
[19] K. Sayood, “Introduction to Data Compression, 3’rd ed.”, Morgan Kaufmann , San Francisco, 2005.
[20] 陳坤茂,作業研究(三版),華泰文化,台北,2005。
[21] Wikipedia, “Multilayer perceptron,”
http://en.wikipedia.org/wiki/Multilayer_perceptron.
[22] 葉怡成,類神經網路模式應用與實作,儒林圖書公司,台北,2006。