研究生: |
蔡承霖 Cheng-Lin Tasi |
---|---|
論文名稱: |
整合音色變換之國語語音合成系統 A Timbre-Conversion Integrated Mandarin Speech Synthesis System |
指導教授: |
古鴻炎
Hung- yan Gu |
口試委員: |
王新民
Hsin-Min Wang 余明興 Ming-Shing Yu 馮輝文 Huei-Wen Ferng |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2009 |
畢業學年度: | 98 |
語文別: | 中文 |
論文頁數: | 77 |
中文關鍵詞: | 音色變換 |
外文關鍵詞: | voice conversion |
相關次數: | 點閱:145 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文提出一個整合音色變換之國語語音合成的方法,並且據以制作出系統,此系統能夠對輸入文句作音色變換及合成處理而得到另一個目標語者的合成語音。我們依照聲、韻母的類別將音節發音作分群,再對各群分別去訓練出一個音節HMM。接著,我們提出以HMM解碼、分段來作時間的正規化,然後拿各個音節發音的8個片段上的平均向量去作主成分之分析。依據一個音節的主成分係數及其語境參數作為輸入,我們使用ANN對映機制,來把來源語者的一個音節的聲、韻母各自的HMM,分別變換成具有目標語者音色的聲、韻母HMM,然後使用所研究的聲、韻母HMM合併方法,合併出一個音節HMM。之後,使用所研究的內差方法去產生出各音框的DCC係數,然後依據DCC係數算出的頻譜包絡及音高調整參數,去控制HNM作信號合成。由主成分個數聽測實驗的結果發現,少數幾個主成分,即可獲得不錯的語音信號品質;另外由STC聽測實驗的結果發現,絕大部分的聽測者都認為變換後的音色是接近目標語者的音色的,這驗證了我們提出的音色變換處理架構的效能。
In this thesis, we propose a timbre-conversion integrated Mandarin speech synthesis method, and build a system according to this method. For an inputted sentence, this system can do timbre-conversion and synthesize speech that is of the timbre like a target speaker. We classify the segmented syllable utterances according to their syllable initials and finals. Then, for each initial and final class, a syllable HMM is trained. Next, we propose a time normalization method that is based on HMM decoding and state segmenting. By this method, each syllable utterance can be represented with 8 mean DCC (discrete cepstrum coefficients) vectors computed from 8 decoded segments. Hence, for each initial or final class, its member syllables’ mean DCC vectors can be collected to perform PCA (principal component analysis). According to the PCA coefficients and contextual data of a syllable, two mapping ANN are used respectively to convert its syllable initial and final HMM into the HMM possessing the timbre characteristic of a target speaker. Then, the two converted HMM are combined into a new HMM by using a method studied here. Afterward, DCC are generated for each frame by using an interpolation method studied here. Then, spectral envelope computed from each frame’s DCC and pitch contour generated by an ANN are used to control HNM (harmonic-plus-noise-model) to synthesize speech signal. According to the listening tests for different numbers of PCA dimensions, synthetic speech quality is acceptable when just a few PCA coefficients are used. Also, according to the STC (source, target, converted) listening tests, most participants think that the timbre of the synthetic speech is close to the timbre of a target speaker. This verifies the performance of our timbre-conversion integrated Mandarin speech synthesis system.
[1] Meng Zhang, J. Tao, J. Nurminen, J. Tian, X. Wang, “Phonetic anchor based state mapping for text-independent voice conversion,” International Conference of Signal processing, Beijing, China, Vol. 1, pp.723-727, 2008.
[2] Elina Helander, Jani Nurminen and Moncef Gabbouj, “LSF mapping for voice converison with very small training sets,” ICASSP , Las Vegas, U.S.A, pp.4669-4672. 2008.
[3] 蔡松峯,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程研究所碩士論文,2009。
[4] En Najjary Taoufik, Rosec Olivier, Chonavel, Thierry, “A voice conversion method based on joint pitch and spectral envelope transformation,” In INTERSPEECH, Jeju Island, Korea, pp.1225-1228, 2004.
[5] D. O'Shaughnessy, Speech Communications 2/E, IEEE Press, 2000.
[6] Cheng-Yuan Lin, J.-S. Roger Jang, "New Refinement Schemes for Voice Conversion," IEEE International Conference on Multimedia & Expo, Baltimore, Maryland, pp.725-728, 2003.
[7] 吳宗翰,以PDA為平台的無線人臉辨識之身份識別系統,私立逢甲大學資訊電機工程研究所碩士論文,2008。
[8] 張文杰,模型調適之語者辨識系統,國立中央大學資訊電機工程研究所碩士論文,2005。
[9] T. Toda, Y Ohtani, and K. Shikano, ”Eigenvoice conversion based on Gaussian mixture model,” ICSLP, Pittsburgh , USA, pp.2446-2449, 2006.
[10] Abe, M., Nakamura, S., Shikano, K., and Kuwabara, H., “Voice Conversion through Vector Quantization,” ICASSP, New York, U.S.A, pp.655-658, 1988.
[11] A. Mouchtaris, Y. Agiomyrqiannakis and Y. Stylianou, “Conditional Vector Quantization for Voice Conversion,” IEEE International Conference on Acoustic, Honolulu, Hawai’i, Vol. 4, pp.505-508, 2007.
[12] Stylianou Y., Capp´e O., Moulines E, ”Continuous Probabilistic Transform for Voice Conversion,” IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 2, pp.131–142.1998.
[13] Min Chu, “Voice Conversion with Smoothed GMM and MAP Adaptation,” Procs. of EuroSpeech, Geneva, Switzerland, pp.2413-2416, 2003.
[14] Hartmut R. Pfizinger, “DFW-based Spectral Smoothing for Concatenative Speech Synthesis,” Procs. of ICSLP, Jeju Island, Korea, Vol. 2, pp.1397-1400, 2004.
[15] Srinivas Desaiy, E. Veera Raghavendray, B. Yegnanarayanay, Alan W Blackz, Kishore Prahallad, “Voice conversion using artificial neural networks,” ICASSP, Taipei, Taiwan, pp.3893–3896, 2009.
[16] 劉德賢,應用雙可夫模型與聲音轉換於情緒語音合成之研究,國立成功大學資訊工程研究所碩士論文,2005.
[17] Zhiwei Shuang; Fanping Meng; Yong Qin, “Voice Conversion by Combining Frequency Warping with Unit Selection,” ICASSP, Las Vegas, U.S.A , pp.4661-4664, 2008.
[18] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程研究所碩士論文,2009。
[19] 吳昌益, 使用頻譜演進模型之國語語音合成研究, 國立台灣科技大學資訊工程研究所碩士論文,2007。
[20] M Turk, A Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, Vol. 3, No. 1, pp.71-86, 1991.
[21] 蔡哲彰,國語合成歌聲流暢度改進之研究,國立台灣科技大學資訊工程研究所碩士論文,2009。
[22] 葉怡成,類神經網路模式應用與實作,儒林圖書公司,台北,2006。
[23] Wikipedia, “Artificial neural network,”
http://en.wikipedia.org/wiki/Artificial_neural_network.
[24] 鄭家豪,改良式倒傳遞類神經網路於水庫入流量預報之研究,國立台灣大學土木工程學系碩士論文,2009。
[25] 林正甫, 使用ANN抖音參數模型之國語歌聲合成, 國立台灣科技大學資訊工程研究所碩士論文,2008。
[26] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Duration Modeling in HMMbased Speech Synthesis System", Proc. of ICSLP, Sydney, Australia, Vol. 2, pp.29–32, 1998.
[27] T. Toda and K. Tokuda, “Speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” Proc. of ICASSP, Istanbul, Turkey, Vo1.3, pp.1315-1318, 2000.
[28] Y. Stylianou, Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification, Ph.D. thesis, Ecole Nationale Supèrieure des Télécommunications, Paris, France, 1996.
[29] Y. Stylianou, “Modeling speech based on harmonic plus noise models”, in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp.244-260, 2005.
[30] OpenCV
http://sourceforge.net/projects/opencvlibrary/.
[31] 曹亦岑, 使用小型語料類神經網路之國語語音合成韻律參數產生, 國立台灣科技大學資訊工程研究所碩士論文,2003。