簡易檢索 / 詳目顯示

研究生: 洪尉翔
Wei-hsiang Hong
論文名稱: 使用MGE訓練之HMM模型及全域變異數匹配之合成語音信號音質改進方法
Synthetic Speech Signal-quality Improving Methods Using Minimum-Generation-Error Trained HMM and Global Variance Matching
指導教授: 古鴻炎
Hung-yan Gu
口試委員: 王新民
Hsin-min Wang
余明興
Ming-shing Yu
鍾國亮
Kuo-liang Chung
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 114
中文關鍵詞: 語音合成信號品質隱藏式馬可夫模型最小生成誤差共振峰增強全域變異數
外文關鍵詞: speech synthesis, signal quality, hidden Markov model, minimum generation error, formant enhancement, global variance
相關次數: 點閱:606下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文採取新的HMM結構,即半段式HMM,而可在少量訓練語料的情況下,大幅提升合成語音的流暢性。此外,我們提出一種方法,將MGE準則之HMM訓練法與共振峰增強法或GV調整法作結合,來改善頻譜過度平滑現象,以提升合成語音的信號品質。在MGE準則之HMM訓練的實作上,我們實驗了公式簡化法與維度無關法,發現維度無關法可獲得較好的生成誤差效能。另外,MGE準則之HMM訓練法裡需考慮三項實作因素,我們比較不同實作因素組合下的客觀效能評估(平均MFCC距離與變異數比值),發現共變異矩陣維持不變且採用分段K中心法來建立初始HMM模型是較好的選擇;當量測的是平均MFCC距離,則整體訓練之流程是較好的選擇,當量測的是變異數比值,則分句訓練流程會是較好的選擇。在共振峰增強法方面,觀察、比較頻譜包絡曲線,可知我們提出的等比式作法比前人的常數式作法較好;如果作GV調整,則權重值必需適當設定,才可避免合成出的語音發生波形振幅突然變大、甚至爆音。在主觀聽測實驗的結果是,MGE相關的合成方法,不管是男聲還是女聲合成語音的MOS聽測評分顯示,在MGE準則下訓練HMM應採用分句訓練之流程才能獲得較好的MOS評分;GV調整法與共振峰增強法基本上都有助於提升合成語音的信號品質,不過它們的合成語音有時會有怪音或爆音出現,而導致MOS評分被扣減。


    In this thesis, we adopt a new HMM (hidden Markov model) structure, i.e. half (half context-dependent and size) HMM, and the synthetic-speech fluency is apparently improved under the situation of limited training sentences. In addition, we study a method that combines minimum generation error (MGE) based HMM training with formant enhancement or global variance matching to alleviate the problem of spectral over-smoothing, which can improve the signal quality of synthetic speech. When implementing MGE based HMM training, we program two different procedures called formula-simplification procedure and dimension-independence procedure, respectively. According to the results of measuring generation error, the dimension-independence procedure is found to be the better one. In practice, MGE based HMM training has three implementation factors that need to be considered. Therefore, we compare different combinations of the implementation factors in terms of objective measures (average MFCC distance and variance ratio). It is found that keeping covariance matrix unchanged and using initial HMM trained with segmental K-mean method is the better choice. According to the measured average MFCC distances, the ensemble-training flow is found to be better than the incremental training flow studied here. Nevertheless, when the measured variance ratios are considered, the incremental training flow will be the better one. As to formant enhancement, by comparing the spectral envelopes obtained with different methods, we found that the geometric-series method proposed here is better than the constant-series method. As to global variance matching, it is found that an appropriate weight value must be set to prevent abrupt amplitude change and click from occurring. According to the results of listening tests, among the speech synthesis methods using the MGE trained HMM, HMM trained with the incremental-training flow is better than with the ensemble-training flow. The results also show that global variance matching and formant enhancement can improve the signal quality of the synthetic speech basically. Nevertheless, clicks or harsh noises may sometimes be heard in the synthesized speech, which cause their MOS scores being decreased.

    摘要 I ABSTRACT II 誌謝 III 目錄 IV 圖表索引 VII 第1章 緒論 1 1.1 研究動機 1 1.2 文獻回顧 2 1.2.1 HMM模型之改進 2 1.2.2 頻譜過度平滑之改善 4 1.2.3 信號合成方法 5 1.3 研究方法 6 1.4論文架構 9 第2章 語料準備和頻譜係數擷取 10 2.1 語料錄音 10 2.2 標音、切音 10 2.3 STRAIHT分析 13 2.4 頻譜係數擷取 13 第3章 HMM頻譜模型訓練 16 3.1 隱藏式馬可夫模型 16 3.2 語料分類 17 3.2.1 聲、韻母分類 17 3.2.2 文脈資訊 18 3.3 新的HMM結構--半段式HMM 21 3.4 HMM模型訓練 24 3.5 狀態時長參數之訓練 26 第4章 最小生成誤差準則之HMM訓練法 28 4.1 最小生成誤差訓練 28 4.2 音框頻譜係數產生 30 4.3 HMM參數最佳化 33 4.4 MGE訓練之流程 35 4.5 MGE訓練之實作法 39 4.5.1 公式簡化實作法 39 4.5.2 維度無關實作法 40 4.6 公式簡化法與維度無關法之比較 41 4.7 實驗因素之其它評估方式 46 第5章 最小生成誤差之HMM模型評估 48 5.1 評估方法 48 5.1.1 平均MFCC距離 48 5.1.2 變異數比值 48 5.2 實驗相關設定 49 5.3 共變異矩陣保持不變與持續更新之比較 50 5.4 等切法與分段K中心法之初始HMM比較 53 5.5 整體訓練與各句訓練流程之比較 56 5.6 MGE之HMM訓練法與其它HMM訓練方法之比較 59 第6章 頻譜過度平滑之改善 63 6.1 共振峰增強法 63 6.1.1 常數式作法 63 6.1.2 等比式作法 64 6.2 全域變異數調整法 65 6.3 語音品質之客觀量測 68 6.3.1 共振峰增強法量測 68 6.3.2 全域變異數調整法的量測 74 第7章 語音信號合成與主觀聽測 77 7.1 HMM模型挑選 77 7.2 HNM信號合成 79 7.3 主觀聽測 81 7.3.1 男性合成語音聽測 83 7.3.2 女性合成語音聽測 86 7.4 合成語音之範例 89 第8章 結論 94 參考文獻 98

    [1] The HTS working group, HMM-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp/.
    [2] H. Zen, K.Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "A Hidden Semi-Markov Model-Based Speech Synthesis System" , IEICE Trans. Information and Systems, vol. E90-D, no.5, pp.825-834, 2007.
    [3] K. Tokuda, H. Zen, and A. W. Black, "An HMM-Based Speech Synthesis System Applied to English", in Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, Sep. 2002.
    [4] K. Shinoda and T. Watanabe, "Acoustic Modeling Based on the MDL Criterion for Speech Recognition", in Proc. EuroSpeech-97. no. 1. 1997.
    [5] Y. J. Wu and R. H. Wang, "Minimum Generation Error Training for HMM-Based Speech Synthesis", in Proc. ICASSP , vol. 1 pp. I-I, 2006.
    [6] Y. J. Wu, R. H. Wang and F. Soong, "Full HMM Training for Minimizing Generation Error in Synthesis", in Proc. ICASSP, vol. 4, pp. 517-520, 2007.
    [7] J. R. Blum, "Multidimensional Stochastic Approximation Methods", The Annals of Mathematical Statistics, pp. 737-744, 1954.
    [8] D. K. Ninh, M. Morise and Y. Yamashita, " Incorporating Dynamic Features into Minimum Generation Error Training for HMM-Based Speech Synthesis", in Proc. ISCSLP, pp. 55-59, 2012.
    [9] F. L. Xie, Y. J. Wu and F. Soong, "Cross Validation and Minimum Generation Error for Improved Model Clustering in HMM-Based TTS", in Proc. ISCSLP, pp.60-63, 2012.
    [10] L. Qin, Y. J. Wu, Z. H. Ling, R. H. Wang and L. R. Dai, "Minimum Generation Error Criterion Considering Global/Local Variance for HMM-Based Speech Synthesis", in Proc. ICASSP, pp.4621-4624, 2008.
    [11] D. Cole and S. Sridharan, "Speech Enhancement by Formant Sharpening in the Cepstral Domain", Cepstral formant enhancement, pp.244-249, 2002.
    [12] T. Toda, and K. Tokuda, "A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis", IEICE trans. Information and Systems, vol. E90-D, no.5, pp.816-824, 2007.
    [13] T. Nose, V. Chunwijitra and T. Kobayashi, "A Parameter Generation Algorithm Using Local Variance for HMM-Based Speech Synthesis", IEEE Journal of Selected Topics in Signal Processing, vol.8, no.2, 2014.
    [14] S. Imai, "Cepstral Analysis Synthesis on the Mel Frequency Scale", in Proc. ICASSP, vol.8, pp.93-96, 1983.
    [15] H. Banno, H. Hata, M. Morise, T. Takahashi, T. Irino and H. Kawahara, "Implementation of Realtime STRAIGHT Speech Manipulation System: Report on its First Implementation", Acoustical science and technology 2007, Vol28, no3, pp. 140-146, 2007.
    [16] Y. Stylianou, "Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis", IEEE trans. Speech and Audio Processing, vol9, no1, pp. 21-29, 2001.
    [17] 賴名彥,結合 HMM頻譜模型與 ANN韻律模型之國語語音合成系統, 國立台灣科技大學資訊工程研究所碩士論文,2009。
    [18] 簡延庭,基於HMM模型之歌聲合成語音色轉換,國立台灣科技大學資訊工程所碩士論文,2013。
    [19] S. Young, "The HTK Hidden Markov Model Toolkit: Design and Philosophy", Tech Report TR.153, Department of Engineering, Cambridge University (UK), 1993.
    [20] K. Sjolander and J. Beskow, Centre of Speech Technolodgy at KTH, http://www.speech.kth.se/wavesurfer/
    [21] The MathWorks, MATLAB, http://www.mathworks.com/products/matlab/.
    [22] 蔡松峰,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
    [23] 古鴻炎,賴名彥,洪尉翔,陳彥樺,基於發音知識以建構頻譜HMM 之國語語音合成方法,ROCLING,2014。
    [24] L. Rabiner and B. H. Juang, "Fundamentals of Speech Recognition" Pretice-Hall International, Inc, 1993.
    [25] M. Tamura, T. Masuko, K. Tokuda and T. Kobayashi, "Adaptation of Pitch and Spectrum for HMM-Based Speech Synthesis Using MLLR", in Proc. ICASSP, vol2, pp.805-808, 2001.
    [26] K. Tokuda, T. Masuko, T. Yamada, T. Kobayashi and S. Imai, "An Algorithm for Speech Parameter Generation from Continuous Mixture HMMs with Dynamic Features", Proc. EUROSPEECH-95, Madrid, Spain, pp.757-760, 1995.
    [27] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T.Kitamura, "Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis", in Proc. ICASSP, vol3, pp.1315-1318, 2000.
    [28] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Duration Modeling for HMM-Based Speech Synthesis", in Proc. ICSLP, vol2, pp.29-32, 1998.
    [29] T. Toda, A. W. Black and K. Tokuda, "Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory", IEEE trans. Audio, Speech, and Language Processing, vol15, no8, pp. 2222-2235, 2007.
    [30] The HTS working group , hts_engine API, http://hts-engine.sourceforge.net/.
    [31] Intel Corporation, OpenCV 1.0, http://sourceforge.net/projects/opencvlibrary/.
    [32] E. Godoy, O. Rosec, and T. Chonavel, "Voice Conversion Using Dynamic Frequency Warping with Amplitude Scaling, for Parallel or Nonparallel Corpora", IEEE trans. Audio, Speech, and Language Processing, vol20, no4, pp. 1313-1323, 2012.
    [33] L. Rabiner, R. W. Schafer and C. M. Rader, "The Chirp Z-Transform Algorithm", IEEE trans. Audio and Electroacoustics, vol17, no2, pp. 86-92, 1969.
    [34] Y. Stylianou, "Modeling Speech Based on Harmonic Plus Noise Models", in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp. 244-260, 2005.
    [35] Y. Stylinaou, "Harmonic Plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification", Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.

    QR CODE