簡易檢索 / 詳目顯示

研究生: 蔡承霖
Cheng-Lin Tasi
論文名稱: 整合音色變換之國語語音合成系統
A Timbre-Conversion Integrated Mandarin Speech Synthesis System
指導教授: 古鴻炎
Hung- yan Gu
口試委員: 王新民
Hsin-Min Wang
余明興
Ming-Shing Yu
馮輝文
Huei-Wen Ferng
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 98
語文別: 中文
論文頁數: 77
中文關鍵詞: 音色變換
外文關鍵詞: voice conversion
相關次數: 點閱:142下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本論文提出一個整合音色變換之國語語音合成的方法,並且據以制作出系統,此系統能夠對輸入文句作音色變換及合成處理而得到另一個目標語者的合成語音。我們依照聲、韻母的類別將音節發音作分群,再對各群分別去訓練出一個音節HMM。接著,我們提出以HMM解碼、分段來作時間的正規化,然後拿各個音節發音的8個片段上的平均向量去作主成分之分析。依據一個音節的主成分係數及其語境參數作為輸入,我們使用ANN對映機制,來把來源語者的一個音節的聲、韻母各自的HMM,分別變換成具有目標語者音色的聲、韻母HMM,然後使用所研究的聲、韻母HMM合併方法,合併出一個音節HMM。之後,使用所研究的內差方法去產生出各音框的DCC係數,然後依據DCC係數算出的頻譜包絡及音高調整參數,去控制HNM作信號合成。由主成分個數聽測實驗的結果發現,少數幾個主成分,即可獲得不錯的語音信號品質;另外由STC聽測實驗的結果發現,絕大部分的聽測者都認為變換後的音色是接近目標語者的音色的,這驗證了我們提出的音色變換處理架構的效能。


In this thesis, we propose a timbre-conversion integrated Mandarin speech synthesis method, and build a system according to this method. For an inputted sentence, this system can do timbre-conversion and synthesize speech that is of the timbre like a target speaker. We classify the segmented syllable utterances according to their syllable initials and finals. Then, for each initial and final class, a syllable HMM is trained. Next, we propose a time normalization method that is based on HMM decoding and state segmenting. By this method, each syllable utterance can be represented with 8 mean DCC (discrete cepstrum coefficients) vectors computed from 8 decoded segments. Hence, for each initial or final class, its member syllables’ mean DCC vectors can be collected to perform PCA (principal component analysis). According to the PCA coefficients and contextual data of a syllable, two mapping ANN are used respectively to convert its syllable initial and final HMM into the HMM possessing the timbre characteristic of a target speaker. Then, the two converted HMM are combined into a new HMM by using a method studied here. Afterward, DCC are generated for each frame by using an interpolation method studied here. Then, spectral envelope computed from each frame’s DCC and pitch contour generated by an ANN are used to control HNM (harmonic-plus-noise-model) to synthesize speech signal. According to the listening tests for different numbers of PCA dimensions, synthetic speech quality is acceptable when just a few PCA coefficients are used. Also, according to the STC (source, target, converted) listening tests, most participants think that the timbre of the synthetic speech is close to the timbre of a target speaker. This verifies the performance of our timbre-conversion integrated Mandarin speech synthesis system.

摘要 I ABSTRCT II 誌謝 III 目錄 IV 圖索引 VII 表索引 IX 第1章 緒論 1 1.1 研究動機及目的 1 1.2 文獻回顧 1 1.2.1 聲學特徵 2 1.2.2 主成分分析 3 1.2.3 對映機制 3 1.3 研究方法 6 1.4 論文架構 10 第2章 語料準備 11 2.1 錄音、標音 11 2.1.1 錄音 11 2.1.2 標音 11 2.2 聲韻母分類 15 第3章 離散倒頻譜係數與隱藏式馬可夫模型 18 3.1 離散倒頻譜係數 18 3.1.1 頻譜包絡估計架構 18 3.1.2 頻譜參數計算 19 3.2 隱藏式馬可夫模型簡介 21 3.2.1 HMM結構 21 3.2.2 HMM訓練 22 3.2.3 狀態時長參數之訓練 24 第4章 主成分分析 26 4.1 主成分分析簡介 26 4.2 主成分分析流程 26 4.3 主成分係數擷取 29 4.4 來源語者主成分逼近誤差 30 第5章 類神經網路模型 32 5.1 類神經網路簡介 32 5.2 類神經網路結構 33 5.3 類神經網路輸出入參數 35 5.4 單元個數實驗 39 第6章 語音音色變換與合成 43 6.1 聲、韻母HMM產生 43 6.2 聲、韻母HMM合併 47 6.3 音框離散倒頻譜係數產生 48 6.4 語音信號合成 53 第7章 系統製作與聽測實驗 57 7.1 系統實作 57 7.1.1 主成分分析 57 7.1.2 聲、韻母HMM產生 57 7.1.3 音長 58 7.1.4 基週軌跡 59 7.2 音色變換與合成程式 60 1.2.1 程式介面 60 1.2.2 程式測試 62 7.3 聽測評估方式 66 7.4 聽測實驗一:主成分個數聽測 67 7.5 聽測實驗二:STC聽測 68 第8章 結論 70 參考文獻 74 作者簡介 78

[1] Meng Zhang, J. Tao, J. Nurminen, J. Tian, X. Wang, “Phonetic anchor based state mapping for text-independent voice conversion,” International Conference of Signal processing, Beijing, China, Vol. 1, pp.723-727, 2008.
[2] Elina Helander, Jani Nurminen and Moncef Gabbouj, “LSF mapping for voice converison with very small training sets,” ICASSP , Las Vegas, U.S.A, pp.4669-4672. 2008.
[3] 蔡松峯,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程研究所碩士論文,2009。
[4] En Najjary Taoufik, Rosec Olivier, Chonavel, Thierry, “A voice conversion method based on joint pitch and spectral envelope transformation,” In INTERSPEECH, Jeju Island, Korea, pp.1225-1228, 2004.
[5] D. O'Shaughnessy, Speech Communications 2/E, IEEE Press, 2000.
[6] Cheng-Yuan Lin, J.-S. Roger Jang, "New Refinement Schemes for Voice Conversion," IEEE International Conference on Multimedia & Expo, Baltimore, Maryland, pp.725-728, 2003.
[7] 吳宗翰,以PDA為平台的無線人臉辨識之身份識別系統,私立逢甲大學資訊電機工程研究所碩士論文,2008。
[8] 張文杰,模型調適之語者辨識系統,國立中央大學資訊電機工程研究所碩士論文,2005。
[9] T. Toda, Y Ohtani, and K. Shikano, ”Eigenvoice conversion based on Gaussian mixture model,” ICSLP, Pittsburgh , USA, pp.2446-2449, 2006.
[10] Abe, M., Nakamura, S., Shikano, K., and Kuwabara, H., “Voice Conversion through Vector Quantization,” ICASSP, New York, U.S.A, pp.655-658, 1988.
[11] A. Mouchtaris, Y. Agiomyrqiannakis and Y. Stylianou, “Conditional Vector Quantization for Voice Conversion,” IEEE International Conference on Acoustic, Honolulu, Hawai’i, Vol. 4, pp.505-508, 2007.
[12] Stylianou Y., Capp´e O., Moulines E, ”Continuous Probabilistic Transform for Voice Conversion,” IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 2, pp.131–142.1998.
[13] Min Chu, “Voice Conversion with Smoothed GMM and MAP Adaptation,” Procs. of EuroSpeech, Geneva, Switzerland, pp.2413-2416, 2003.
[14] Hartmut R. Pfizinger, “DFW-based Spectral Smoothing for Concatenative Speech Synthesis,” Procs. of ICSLP, Jeju Island, Korea, Vol. 2, pp.1397-1400, 2004.
[15] Srinivas Desaiy, E. Veera Raghavendray, B. Yegnanarayanay, Alan W Blackz, Kishore Prahallad, “Voice conversion using artificial neural networks,” ICASSP, Taipei, Taiwan, pp.3893–3896, 2009.
[16] 劉德賢,應用雙可夫模型與聲音轉換於情緒語音合成之研究,國立成功大學資訊工程研究所碩士論文,2005.
[17] Zhiwei Shuang; Fanping Meng; Yong Qin, “Voice Conversion by Combining Frequency Warping with Unit Selection,” ICASSP, Las Vegas, U.S.A , pp.4661-4664, 2008.
[18] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程研究所碩士論文,2009。
[19] 吳昌益, 使用頻譜演進模型之國語語音合成研究, 國立台灣科技大學資訊工程研究所碩士論文,2007。
[20] M Turk, A Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, Vol. 3, No. 1, pp.71-86, 1991.
[21] 蔡哲彰,國語合成歌聲流暢度改進之研究,國立台灣科技大學資訊工程研究所碩士論文,2009。
[22] 葉怡成,類神經網路模式應用與實作,儒林圖書公司,台北,2006。
[23] Wikipedia, “Artificial neural network,”
http://en.wikipedia.org/wiki/Artificial_neural_network.
[24] 鄭家豪,改良式倒傳遞類神經網路於水庫入流量預報之研究,國立台灣大學土木工程學系碩士論文,2009。
[25] 林正甫, 使用ANN抖音參數模型之國語歌聲合成, 國立台灣科技大學資訊工程研究所碩士論文,2008。
[26] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Duration Modeling in HMMbased Speech Synthesis System", Proc. of ICSLP, Sydney, Australia, Vol. 2, pp.29–32, 1998.
[27] T. Toda and K. Tokuda, “Speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” Proc. of ICASSP, Istanbul, Turkey, Vo1.3, pp.1315-1318, 2000.
[28] Y. Stylianou, Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification, Ph.D. thesis, Ecole Nationale Supèrieure des Télécommunications, Paris, France, 1996.
[29] Y. Stylianou, “Modeling speech based on harmonic plus noise models”, in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp.244-260, 2005.
[30] OpenCV
http://sourceforge.net/projects/opencvlibrary/.
[31] 曹亦岑, 使用小型語料類神經網路之國語語音合成韻律參數產生, 國立台灣科技大學資訊工程研究所碩士論文,2003。

QR CODE