研究生: |
陳彥樺 Yen-Hua Chen |
---|---|
論文名稱: |
以聲學語言模型、全域變異數匹配及目標音框挑選作強化之語音轉換系統 A Voice Conversion System Enhanced with Acoustic Language-model, Global Variance Matching, and Target Frame Selection |
指導教授: |
古鴻炎
Hung-Yan Gu |
口試委員: |
王新民
Hsin-Min Wang 余明興 Ming-Shing Yu 林伯慎 Bor-Shen Lin |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 84 |
中文關鍵詞: | 語音轉換 、聲學語言模型 、目標音框挑選 、全域變異數 、離散倒頻譜係數 、高斯混合模型 、諧波加雜音模型 |
外文關鍵詞: | voice conversion, acoustic language-model, target frame selection, global variance, discrete cepstral coefficient, Gaussian mixture model, harmonic-plus-noise model |
相關次數: | 點閱:333 下載:11 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文研究了組合式之語音轉換方法來強化以GMM為基礎之語音轉換功能,這種組合式方法包含了PPM聲學語言模型(ALM)、目標音框挑選(TFS)與全域變異數(GV)匹配等處理步驟,我們實作了兩個組合式語音轉換方法,分別是ALM+TFS+GV法與ALM+GV+TFS法。在訓練階段,我們使用訓練出的GMM之128個高斯混合的平均向量來建立近似音素之二元分類樹,再用此分類樹於訓練PPM聲學語言模型。在轉換階段,我們依據ALM估計的機率去對輸入音框作近似音素的分段,然後各音框依其對應的近似音素去作單一高斯混合之頻譜對映,接著再作TFS與GV匹配等處理,以便改善頻譜包絡過度平滑的問題。TFS依轉換後音框的DCC係數,到目標語者訓練語料中挑選出距離最接近的音框DCC來做取代;GV匹配則是把一序列音框的DCC係數之變異數特性匹配到目標語者的變異數特性。由客觀量測實驗的結果發現,我們轉換方法的平均DCC誤差距離會比基本轉換方法的大,但變異數比值(VR)則會變高變好。此外,從主觀聽測實驗的結果可知,本論文所提出的語音轉換方法能夠提升轉換後語音的信號品質,並且轉換出語音的音色也相當接近目標語者的。
In this thesis, a combination method for voice conversion is proposed to enhance the performance of GMM based voice conversion systems. The combination method includes the processing modules, PPM acoustic language-model (ALM), target frame selection (TFS), and global variance (GV) matching. Actually, we implement the two voice conversion methods: ALM+TFS+GV and ALM+GV+TFS. In training stage, we use the 128 mean vectors of Gaussian mixtures from a trained GMM to establish a quasi-phonetic symbol binary classification tree. Then, the tree is used to train ALM. In conversion stage, input voice frames are segmented according to the probabilities estimated by ALM. Next, each voice frame’s spectrum is mapped with a single Gaussian mixture that corresponds to this frame. Afterward, the two modules, TFS and GV, are executed in order to reduce the problem of over-smoothed spectral envelope. In TFS, a converted DCC (discrete cepstral coefficient) vector for an input frame is used to search the nearest frame from the target-speaker training frames, and the found DCC is taken to replace the converted. GV matching can adjust the DCC’s variance of a sequence of converted DCC to match the variance of the target-speaker’s training DCC vectors.
According to the results of objective tests, the average DCC error of our method is larger than the baseline method. However, the signal-quality index, variance ratio (VR), indicates our method is better. In addition, according to the results of perception tests, the converted speech by our method can obtain higher signal quality and higher timbre similarity than the baseline method.
[1] 蔡松,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
[2] 王讚緯,使用值方圖等化及目標音框挑選之語音轉換系統,國立台灣科技大學資訊工程所碩士論文,2014。
[3] 張家維,使用主成分向量投影及最小均方對映之語音轉換方法,國立台灣科技大學資訊工程所碩士論文,2012。
[4] H.Valbret, E. Moulines, and J.P. Tubach, “Voice transformation using PSOLA technique,” in 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-92, vol. 1, San Francisco, CA, USA, 23-26 Mar. 1992, pp. 145-148.
[5] D. Erro, E. Navas, and I. Hernaez, “Parametric voice conversion based on bilinear frequency warping plus amplitude scaling,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 3, pp.556-566, 2013.
[6] X. Tian, Z. Wu, S.W. Lee, and E.S. Chng, “Correlation-based frequency warping for voice conversion,” In 2014 9th International Symposium on Chinese Spoken Language Processing, ISCSLP, Singapore, 12-14 Sept. 2014, pp. 211-215, IEEE.
[7] M. Narendranath, H.A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks,” Speech Communication, vol. 16, no. 2, pp. 207-216, 1995.
[8] F.L. Xie, Y. Qian, Y. Fan, F. K. Soong, and H. Li, “Sequence error(SE) minimization training of neural network for voice conversion,” Interspeech, pp. 2283-2287, 2014.
[9] Y. Stylianou and O. Cappe and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transaction on Speech and Audio Processing, Vol. 6, No. 2, pp. 131-142, 1998.
[10] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Speech and Audio Processing, Vol. 15, No. 8, pp. 2222-2235, 2007.
[11] T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Perez, and Y. Stylianou, “Toward a voice conversion system based on frame selection,” in 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, vol. 4, Honolulu, HI, 15-20 Apr. 2007, pp. 513-516.
[12] H.Y. Gu and S.F. Tsai, “A voice conversion method combining segmental GMM mapping with target frame selection”, Journal of Information Science and Engineering, Vol. 31, No. 2, pp. 609-626, 2015.
[13] 蔡仲明,基於GMM及PPM模型的國、閩南、客語之語言辨識,國立台灣科技大學資訊工程所碩士論文,2007。
[14] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book(for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
[15] K. Sjolander and J. Beskow, Centre of Speech Technolodge at KTH, Available: http:// www.speech.kth.se/wavesurfer/.
[16] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程所碩士論文,2007。
[17] K. Sayood, Data Compression, 2nd ed., San Francisco, CA: Morgan Kaufmann Publishers, 2000.
[18] W.J. Teahan, “Probability estimation for PPM,” in New Zealand Computer Science Research Student Conference, NZCSRSC'95, Apr. 1995.
[19] L. Rabiner and B.H. Juang, Fundamental of speech recognition, NJ, USA: Prentice Hall, 1993.
[20] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,” Communication in Statistics, Vol. 3, No. 1, pp. 1-27, 1974.
[21] R.A. Redner and H.F. Walker, “Mixture densities, maximum likelihood and the EM algorithm,” SIAM Review, Vol. 26, No. 2, pp. 195-239, 1984.
[22] A. Kain, High resolution voice transformation, PhD dissertation, Oregon Health & Science University, 2001.
[23] T. Toda, A.W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, Language, Processing, vol. 15, no. 8, pp. 2222-2235, 2007.
[24] 洪尉翔,使用MGE訓練之HMM模型及全域變異數匹配之合成語音信號音質改進方法,國立台灣科技大學資訊工程所碩士論文,2015。
[25] E. Godoy, O. Rosec and T. Chonavel, “Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, pp. 1313-1323, 2012.
[26] Y. Stylinaou, Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification, PhD Thesis, Ecole National Superieure des Telecommunications, 1996.