Basic Search / Detailed Display

Author: 黃崇哲
Chung-Che Huang
Thesis Title: 用於名人語音合成之PCA與ANN為基礎的音色轉換方法
A PCA and ANN Based Timbre Conversion Method for Synthesizing a Famous Person's Speech
Advisor: 古鴻炎
Hung- yan Gu
Committee: 陳秋華
Chyou-Hwa Chen
王新民
Hsin-Min Wang
林伯慎
Bor-shen Lin
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2011
Graduation Academic Year: 99
Language: 中文
Pages: 88
Keywords (in Chinese): 音色轉換非平行語料主成分分析類神經網路隱藏式馬可夫模型
Keywords (in other languages): timbre-conversion, non-parallel corpus, PCA, ANN, HMM
Reference times: Clicks: 268Downloads: 4
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

在少量非平行語料的情況下,本論文嘗試解決PCA與ANN結合之音色轉換方法所面對的問題,然後據以建立一個可合成出名人音色之國語語音合成系統。我們分別對聲、韻母設計一種分類方式,以減少類別數,然後對各類別的音節發音去訓練出一個共用的HMM結構,用以對各音節求取固定個數之DCC頻譜特徵向量,我們期望以這個共用的HMM結構去改進常見的左至右HMM結構。對於對映來源語音主成分係數至目標語音主成分係數之ANN的訓練,在非平行語料之情況,我們研究了一種混合粗糙分類與細緻分類的語境分類方式,來為各個目標音節尋找語境較為近似的來源音檔,接著對各個聲、韻母類別進行PCA分析及訓練專屬的ANN對映機制。在合成階段,則整合PCA與ANN結合之音色轉換功能至前人發展的國語語音合成系統,然後合成出經過轉換音色的語音去作聽測實驗,實驗的結果顯示,所合成出的語音音色多數人認為有些像目標名人,不過音質與韻律特性則需進一步改進。


In this thesis, we try to solve the problems encountered when applying the PCA (principle component analysis) and ANN (artificial neural network) based timbre-conversion method under the situation that only a small and non-parallel corpus is available. Then, this method is adopted to construct a Mandarin speech synthesis system to synthesize speech with a famous person's timbre. We design special classification methods for syllable initial and final, respectively, to decrease the number of categories. For the syllable signals of a category, an HMM of a state-sharing structure is trained in order to obtain a fixed number of DCC (discrete cepstral coefficient) vectors for each syllable signal. This state-sharing structure of HMM is intended to improve the left-to-right HMM structure. An ANN is adopted to map the PCA coefficients of a source syllable into the PCA coefficients of a target syllable. For training this ANN, we propose a method to classify a context under which a syllable is pronounced. This method combines precise classification and rough classification. In terms of the result of context classification, we can find some source syllables that are most similarly in context for a target syllable. For each category of syllable initial and final, a separate ANN for mapping PCA coefficients is trained after the DCC vectors are analyzed with PCA. In the synthesis stage, we integrate the function of PCA and ANN based timbre-conversion into the Mandarin speech synthesis system developed previously by others. By using this system, some speech signals are synthesized with timbre converted, and used to conduct listening tests. The results of the tests show that the timbre of the synthesized speech is a little similar to that of a target famous person. However, the signal quality and the characteristic of prosody need to be improved in the future.

摘要 I ABSTRCT II 誌謝 III 目錄 IV 圖索引 VII 表索引 VIII 第1章 緒論 1 1.1 研究動機及目的 1 1.2 文獻回顧 2 1.3 研究方法 5 1.4 論文架構 11 第2章 語料準備 12 2.1 語料收集與處理 12 2.1.1 錄音 12 2.1.2 標音 13 2.2 發音品質分級 15 2.3 聲韻母分類 16 2.3.1 聲母101類、韻母135類 16 2.3.2 聲母 65類、韻母 72類 17 2.3.3 聲母 46類、韻母 37類 19 第3章 隱藏式馬可夫模型 20 3.1 HMM簡介 20 3.2 左至右之HMM訓練 21 3.2.1 維特比搜尋 22 3.2.2 狀態音框收集 24 3.2.3 計算高斯分布與轉移機率、判斷收斂 24 3.3 聲母狀態共用與韻母狀態共用之HMM結構 26 3.4 狀態共用HMM之訓練 27 第4章 主成分分析與非平行語料配對 30 4.1 主成分分析—基本方法 30 4.2 主成分分析—減少計算量 32 4.3 語境成分之分類 34 4.4 語境配對 38 第5章 類神經網路對映 42 5.1 類神經網路結構 42 5.2 類神經網路輸出入參數 44 5.3 單元個數實驗 46 第6章 語音變換與合成 50 6.1 MLP對映 50 6.2 音節DCC係數產生 52 6.3 語音信號合成 54 6.4 程式介面 56 第7章 聽測實驗 59 7.1 聽測評估方式 59 7.2 聽測實驗一 60 7.3 聽測實驗二 61 7.4 聽測實驗三 62 7.5 聽測實驗四 63 第8章 結論 65 參考文獻 69 附錄一 聲母細緻分類P(precise) 72 附錄二 韻母細緻分類P(precise) 73 附錄三 聲母粗糙分類R(rough) 74 附錄四 韻母粗糙分類R(rough) 75 作者簡介 76

[1] Z.Yue , X. Zou , Y. Jia, H. Wang, ”Voice conversion using HMM combined with GMM”, 2008 Congress on Image and Signal Processing, Sanya, Hainan, China, pp. 366-370, 2008.
[2] B. Zhang, Y. Yu, ”Voice Conversion Based on Improved GMM and Spectrum with Synchronous Prosody”, ICSP2008 Proceedings, Leipzig, Germany, pp. 659-662, 2008.
[3] D. O'Shaughnessy, Speech Communications 2/E, IEEE Press, 2000.
[4] O. Cappe, E. Moulines, ”Regularization Techniques for Discrete Cepstrum Estimation”, IEEE Signal processing letters, Vol. 3, No. 4, pp. 100-102, April 1996.
[5] E. N. Taoufik, R. Olivier, Chonavel, Thierry, “A voice conversion method based on joint pitch and spectral envelope transformation”, In Interspeech, Jeju Island, Korea, pp. 1225-1228, 2004.
[6] Z. H .Jian, Z. Yang, ”Voice Conversion Using Viterbi Algorithm Based On Gaussian Mixture Model”, Proceedings of 2007 International Symposium on Intelligent Signal Processing and Communication Systems, Xiamen, China, pp. 32-35.
[7] 吳嘉彧, 王小川, ”A Voice Conversion System based on Formant and LSF Mapping without Using Parallel Corpus”, pp. 319-332, 2009.
[8] A. Mouchtaris, J. V. Spiegel, P. Mueller, ”Department of Electrical & Systems Engineering Departmental Papers (ESE)”, IEEE Transactions on Audio, Speech and Language Processing, Vol. 14, No. 2, pp . 952-963, May 2006.
[9] Z. H. Jian, Z. Yang, ”Voice Conversion without Parallel Speech Corpus Based on Mixtures of Linear Transform”, Wireless Communications, Networking and Mobile Computing, 2007, White Plains, New York, USA, pp. 2825-2828,2007.
[10] Q. Yong, S. Zhiwei, R. Bakis, ” IBM Voice Conversion Systems for 2007 TC-STAR Evaluation”, Tsinghua Science and Technology Issn 1007-0214 13/22, pp. 510-514, 2008.
[11] Y. Stylianou, O. Capp´e, E. Moulines, “Continuous Probabilistic Transform for Voice Conversion”, IEEE Transaction on speech and audio processing, Vol. 6, No. 2 , pp. 131-142, March 1998.
[12] Y. Yu, B. Zhang, ”Voice Conversion Based on Improved GMM and Spectrum with Synchronous Prosody”, ICSP 2008 Proceedings, Leipzig, Germany, pp. 659-662, 2008.
[13] A. K. Paul, D. Das, Md. M. Kamal, ”Bangla Speech Recognition System using LPC and ANN” , 2009 Seventh International Conference on Advances in Pattern Recognition, Kolkata, West Bengal, India, pp. 171-174, 2009.
[14] 蔡承霖,整合音色變換之國語語音合成系統,國立台灣科技大學資訊工程研究所碩士論文,2009。
[15] 蔡松峯,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程研究所碩士論文,2009。
[16] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程研究所碩士論文,2009。
[17] L. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition”, Pretice-Hall International, Inc.1993.
[18] K. Tokuda, H. Zen, and A.W. Black, “An HMM-based speech synthesis system applied to English”, Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, Sep. 2002.
[19] M. Turk, A. Pentland, “Eigenfaces for recognition”, Journal of cognitive neuroscience, Vol. 3, No. 1, pp.71-86, 1991.
[20] 葉怡成,類神經網路模式應用與實作,儒林圖書公司,台北,2006。
[21] 蔡哲彰,國語合成歌聲流暢度改進之研究,國立台灣科技大學資訊工程研究所碩士論文,2009。
[22] Wikipedia, “Artificial neural network,”
http://en.wikipedia.org/wiki/Artificial_neural_network.
[23] 吳昌益, 使用頻譜演進模型之國語語音合成研究, 國立台灣科技大學資訊工程研究所碩士論文,2007。
[24] 林正甫, 使用ANN抖音參數模型之國語歌聲合成, 國立台灣科技大學資訊工程研究所碩士論文,2008。
[25] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Duration Modeling in HMMbased Speech Synthesis System", Proc. of ICSLP, Sydney, Australia, Vol. 2, pp. 29–32, 1998.

QR CODE