簡易檢索 / 詳目顯示

研究生: 陳彥樺
Yen-Hua Chen
論文名稱: 以聲學語言模型、全域變異數匹配及目標音框挑選作強化之語音轉換系統
A Voice Conversion System Enhanced with Acoustic Language-model, Global Variance Matching, and Target Frame Selection
指導教授: 古鴻炎
Hung-Yan Gu
口試委員: 王新民
Hsin-Min Wang
余明興
Ming-Shing Yu
林伯慎
Bor-Shen Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 84
中文關鍵詞: 語音轉換聲學語言模型目標音框挑選全域變異數離散倒頻譜係數高斯混合模型諧波加雜音模型
外文關鍵詞: voice conversion, acoustic language-model, target frame selection, global variance, discrete cepstral coefficient, Gaussian mixture model, harmonic-plus-noise model
相關次數: 點閱:323下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文研究了組合式之語音轉換方法來強化以GMM為基礎之語音轉換功能,這種組合式方法包含了PPM聲學語言模型(ALM)、目標音框挑選(TFS)與全域變異數(GV)匹配等處理步驟,我們實作了兩個組合式語音轉換方法,分別是ALM+TFS+GV法與ALM+GV+TFS法。在訓練階段,我們使用訓練出的GMM之128個高斯混合的平均向量來建立近似音素之二元分類樹,再用此分類樹於訓練PPM聲學語言模型。在轉換階段,我們依據ALM估計的機率去對輸入音框作近似音素的分段,然後各音框依其對應的近似音素去作單一高斯混合之頻譜對映,接著再作TFS與GV匹配等處理,以便改善頻譜包絡過度平滑的問題。TFS依轉換後音框的DCC係數,到目標語者訓練語料中挑選出距離最接近的音框DCC來做取代;GV匹配則是把一序列音框的DCC係數之變異數特性匹配到目標語者的變異數特性。由客觀量測實驗的結果發現,我們轉換方法的平均DCC誤差距離會比基本轉換方法的大,但變異數比值(VR)則會變高變好。此外,從主觀聽測實驗的結果可知,本論文所提出的語音轉換方法能夠提升轉換後語音的信號品質,並且轉換出語音的音色也相當接近目標語者的。


    In this thesis, a combination method for voice conversion is proposed to enhance the performance of GMM based voice conversion systems. The combination method includes the processing modules, PPM acoustic language-model (ALM), target frame selection (TFS), and global variance (GV) matching. Actually, we implement the two voice conversion methods: ALM+TFS+GV and ALM+GV+TFS. In training stage, we use the 128 mean vectors of Gaussian mixtures from a trained GMM to establish a quasi-phonetic symbol binary classification tree. Then, the tree is used to train ALM. In conversion stage, input voice frames are segmented according to the probabilities estimated by ALM. Next, each voice frame’s spectrum is mapped with a single Gaussian mixture that corresponds to this frame. Afterward, the two modules, TFS and GV, are executed in order to reduce the problem of over-smoothed spectral envelope. In TFS, a converted DCC (discrete cepstral coefficient) vector for an input frame is used to search the nearest frame from the target-speaker training frames, and the found DCC is taken to replace the converted. GV matching can adjust the DCC’s variance of a sequence of converted DCC to match the variance of the target-speaker’s training DCC vectors.
    According to the results of objective tests, the average DCC error of our method is larger than the baseline method. However, the signal-quality index, variance ratio (VR), indicates our method is better. In addition, according to the results of perception tests, the converted speech by our method can obtain higher signal quality and higher timbre similarity than the baseline method.

    摘要 I ABSTRCT II 誌謝 III 目錄 IV 圖表索引 VI 第1章 緒論 1 1.1 研究動機 1 1.2 文獻回顧 1 1.3 研究方法 5 1.3.1 語音轉換系統之訓練流程 7 1.3.2 語音轉換系統之轉換流程 9 1.4 論文架構 11 第2章 語料準備與頻譜特徵參數 12 2.1 語料錄音 12 2.2 標音、切音 12 2.3 離散倒頻譜係數估計 14 2.4 DTW音框匹配 16 第3章 聲學語言模型之訓練與測試 18 3.1 建立近似音素之分類樹 18 3.2 向量量化編碼-近似音素符號 21 3.3 訓練PPM聲學語言模型 22 3.4 基於聲學語言模型之近似音素挑選 25 3.4.1 近似音素符號挑選 26 3.4.2 以動態規劃尋找最佳之近似音素序列 27 3.5 聲學語言模型之測試 29 3.5.1 PPM聲學語言模型之perplexity評估 29 3.5.2 近似音素挑選之正確率 31 第4章 頻譜係數對映 36 4.1 高斯混合模型(GMM) 36 4.2 單一高斯混合挑選與對映 39 4.2.1 單一高斯混合挑選 40 4.2.2 單一高斯混合對映 42 第5章 頻譜過度平滑之改進方法 44 5.1 目標音框挑選 45 5.2 全域變異數調整法 48 第6章 語音轉換實驗 51 6.1 語者配對 51 6.2 平均DCC誤差及變異數比值量測 53 6.3 實驗一:全域變異數權重值之比較 54 6.4 實驗二:不同轉換方法之比較 57 第7章 系統製作與聽測實驗 60 7.1 音高轉換 60 7.2 HNM合成 61 7.3 聽測實驗一:本論文方法之比較 63 7.4 聽測實驗二:與前人方法之比較 66 第8章 結論 69 參考文獻 73

    [1] 蔡松,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
    [2] 王讚緯,使用值方圖等化及目標音框挑選之語音轉換系統,國立台灣科技大學資訊工程所碩士論文,2014。
    [3] 張家維,使用主成分向量投影及最小均方對映之語音轉換方法,國立台灣科技大學資訊工程所碩士論文,2012。
    [4] H.Valbret, E. Moulines, and J.P. Tubach, “Voice transformation using PSOLA technique,” in 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-92, vol. 1, San Francisco, CA, USA, 23-26 Mar. 1992, pp. 145-148.
    [5] D. Erro, E. Navas, and I. Hernaez, “Parametric voice conversion based on bilinear frequency warping plus amplitude scaling,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 3, pp.556-566, 2013.
    [6] X. Tian, Z. Wu, S.W. Lee, and E.S. Chng, “Correlation-based frequency warping for voice conversion,” In 2014 9th International Symposium on Chinese Spoken Language Processing, ISCSLP, Singapore, 12-14 Sept. 2014, pp. 211-215, IEEE.
    [7] M. Narendranath, H.A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks,” Speech Communication, vol. 16, no. 2, pp. 207-216, 1995.
    [8] F.L. Xie, Y. Qian, Y. Fan, F. K. Soong, and H. Li, “Sequence error(SE) minimization training of neural network for voice conversion,” Interspeech, pp. 2283-2287, 2014.
    [9] Y. Stylianou and O. Cappe and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transaction on Speech and Audio Processing, Vol. 6, No. 2, pp. 131-142, 1998.
    [10] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Speech and Audio Processing, Vol. 15, No. 8, pp. 2222-2235, 2007.
    [11] T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Perez, and Y. Stylianou, “Toward a voice conversion system based on frame selection,” in 2007 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, vol. 4, Honolulu, HI, 15-20 Apr. 2007, pp. 513-516.
    [12] H.Y. Gu and S.F. Tsai, “A voice conversion method combining segmental GMM mapping with target frame selection”, Journal of Information Science and Engineering, Vol. 31, No. 2, pp. 609-626, 2015.
    [13] 蔡仲明,基於GMM及PPM模型的國、閩南、客語之語言辨識,國立台灣科技大學資訊工程所碩士論文,2007。
    [14] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book(for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
    [15] K. Sjolander and J. Beskow, Centre of Speech Technolodge at KTH, Available: http:// www.speech.kth.se/wavesurfer/.
    [16] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程所碩士論文,2007。
    [17] K. Sayood, Data Compression, 2nd ed., San Francisco, CA: Morgan Kaufmann Publishers, 2000.
    [18] W.J. Teahan, “Probability estimation for PPM,” in New Zealand Computer Science Research Student Conference, NZCSRSC'95, Apr. 1995.
    [19] L. Rabiner and B.H. Juang, Fundamental of speech recognition, NJ, USA: Prentice Hall, 1993.
    [20] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,” Communication in Statistics, Vol. 3, No. 1, pp. 1-27, 1974.
    [21] R.A. Redner and H.F. Walker, “Mixture densities, maximum likelihood and the EM algorithm,” SIAM Review, Vol. 26, No. 2, pp. 195-239, 1984.
    [22] A. Kain, High resolution voice transformation, PhD dissertation, Oregon Health & Science University, 2001.
    [23] T. Toda, A.W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, Language, Processing, vol. 15, no. 8, pp. 2222-2235, 2007.
    [24] 洪尉翔,使用MGE訓練之HMM模型及全域變異數匹配之合成語音信號音質改進方法,國立台灣科技大學資訊工程所碩士論文,2015。
    [25] E. Godoy, O. Rosec and T. Chonavel, “Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, pp. 1313-1323, 2012.
    [26] Y. Stylinaou, Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification, PhD Thesis, Ecole National Superieure des Telecommunications, 1996.

    QR CODE