簡易檢索 / 詳目顯示

研究生: 張世穎
SHIH-YING CHANG
論文名稱: 結合HTS頻譜模型與ANN韻律模型之國語語音合成系統
A Mandarin Speech Synthesis System Combining HTS Spectrum Models and ANN Prosody Models
指導教授: 古鴻炎
Hung-Yan Gu
口試委員: 江振宇
none
余明興
none
林伯慎
none
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2013
畢業學年度: 101
語文別: 中文
論文頁數: 75
中文關鍵詞: 國語語音合成系統類神經網路韻律模型隱藏式馬可夫模型離散倒頻譜係數
外文關鍵詞: STRAIGHT, ANN Prosody Model, Mandarin Speech Synthesis System, DCC
相關次數: 點閱:277下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文研究了一種結合HTS頻譜模型與ANN韻律模型的國語語音合成之系統架構。在訓練階段,關於頻譜係數的分析,我們使用STRAIGHT來求得較準確的頻譜包絡,然後將各音框的頻譜包絡換算成DCC係數;接著使用HTS來訓練頻譜HMM模型及決策樹。在合成階段時,我們自行發展程式來利用HTS訓練之頻譜HMM模型與決策樹,以產生各音框之DCC係數,並且使用ANN韻律模型來產生音長及基週軌跡參數,然後將兩者產生的參數送給HNM信號合成模組,以合成出語音信號。此外我們應用了一種共振峰增強的作法,使得頻譜包絡過度平滑的問題獲得了改善;在音量及基週軌跡方面,我們分別採取了不同內插作法,也使得自然度獲得改善。在系統製作完成之後,我們對於原本的HTS系統和本系統做了一個參數測量及兩項聽測實驗的評估,結果顯示,本論文所製作的系統在合成語音的自然度、清晰度上,都獲得了明顯較佳的表現。


    ABSTRACT
    In this thesis, a system framework for Mandarin speech synthesis is studied, which combines HTS (HMM-based speech synthesis system) trained spectrum models and ANN (artificial neural network) prosody models. In the training stage, STRAIGHT is used to estimate more accurate spectral envelopes from the training speech frames. The spectral envelope of each frame is converted to DCC (discrete cepstrum coefficients). Then, HMM (hidden Markov model) spectrum models and decision trees are trained by using HTS. In the synthesis stage, we develop programs to utilize HTS trained HMM spectrum models and decision trees in order to generate DCC for each frame. In addition, ANN prosody models are used to generate the parameters of syllable duration and pitch contour. Then, the prosodic parameters and DCC are sent to the HNM (harmonic plus noise model) signal synthesis module to synthesize speech signals. Additionally, we adopt a formant enhancement method to solve the problem of spectral over smoothing. In terms of some interpolation methods to adjust the intensities and pitch heights of the frames around a syllable boundary, the naturalness level is significantly improved. After our system is built, we compare the performances of our system and the original HTS system on DCC generation and two types of listening tests. The results show that our system is not only better on naturalness level but also better on speech clarity level.

    目錄 I 圖表索引 III 第1章 緒論 1 1.1 研究動機與目的 1 1.2 文獻回顧 2 1.2.1 系統架構 3 1.2.2 頻譜演進 5 1.2.3 語音訊號合成方法回顧 6 1.3 研究方法與貢獻 8 1.4 論文架構 12 第2章 語料準備 13 2.1 錄音、語料 13 2.2 標音 13 2.3 STRAIGHT分析 15 2.4 頻譜係數求取 16 2.4.1 頻譜包絡估計架構 17 2.4.2 頻譜係數計算 19 第3章 HMM頻譜模型訓練 20 3.1 HTS語音合成 20 3.2 隱藏式馬可夫模型簡介 22 3.3 HTS語料及相關檔案準備 23 3.3.1 係數設定與檔案配置 24 3.3.2 國語語音單元設定 25 3.4 特徵係數萃取 26 3.4.1 MGC係數 26 3.4.2 DCC係數 27 3.5 文脈無關之HMM模型 28 3.6 文脈相依之HMM模型 29 3.7 HMM樹狀分群與決策樹 31 3.7.1 問題集 31 3.7.2 決策樹 35 3.8 HMM訓練及第二次分群 36 3.9 HTS訓練結果 37 第4章 語音合成系統製作 40 4.1 HMM模型挑選 40 4.2 韻律參數產生 43 4.2.1 時長 43 4.2.2 基週軌跡 45 4.3 HMM 狀態駐留長度 46 4.4 音框DCC係數產生 46 4.4.1 線性內插 46 4.4.2 加權式線性內插 47 4.4.3 頻譜過度平滑問題之改進 49 4.5 語音流暢度改進 50 4.5.1 音量處理 50 4.5.2 基週軌跡處理 51 4.6 HNM信號合成 52 第5章 語音合成與效能評估實驗 55 5.1 語音合成實驗 55 5.1.1 共振峰增強比較 55 5.1.2 音量處理比較 56 5.1.3 基週軌跡處理比較 57 5.1.4 內部語句 58 5.1.5 外部語句 59 5.2 效能評估 60 5.2.1 客觀量測 61 5.2.2 主觀聽測 62 第6章 結論 66 參考文獻 69 附錄A 語料 72 作者簡介 75

    [1] D. Cole and S. Sridharan , “Speech enhancement by formant sharpening in the cepstral domain, ” Cole et al. Cepstral formant enhancement,pp.244-249,2002.
    [2] C. Dodge, and T. A. Jerse, Computer Music: Synthesis, composition, and performance, 2’nd ed., Schirmer Books, 1997.
    [3] C. Huang, Y. shi, J. Zhou, M. Chu, T. Wang and E. Chang, “Segmental tonal modeling for phone set design in mandarin lvcsr,” ICASSP 2004, pp. I – 901- I – 904, 2004.
    [4] H. Banno, H. Hata, M. Morise, T. Takahashi, T. Irino and H. Kawahara, “ Implementation of realtime STRAIGHT speech manipulation system,” Acoust. Sci. & Tech. 2007. Vol.28, No.3, pp.140-146, 2007.
    [5] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne’, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction,” Speech Communication 27, pp. 187–207 , 1999.
    [6] H. Zen, K. Tokuda, K. Oura, K. Hashimoto, S. Shiota, S. Takaki, J. Yamagishi, T. Toda, T. Nose, S. Sako, Alan W. Black, HMM-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp/.
    [7] illinois university, HTK, “Forced Alignment,” https://netfiles.uiuc.edu/tyoon/www/ForcedAlignment.htm.
    [8] K. Shinoda and T. Watanabe, “Acoustic modeling based on the mdl principle for speech recognition,” Rhodes, Greece, September 22-25, ISCA, 1997.
    [9] K. Sjolander and J. Beskow, Centre of Speech Technolodge at KTH, http://www.speech.kth.se/wavesurfer/.
    [10] K. Sjolander, The Snack Sound Toolkit, http://www.speech.kth.se/snack/
    [11] K. Tokuda, H. Zen, and A.W. Black. “An hmm-based speech synthesis system applied to english,”Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, Sep. 2002.
    [12] MathWorks, MATLAB, http://www.mathworks.com/products/matlab/.
    [13] P. Birkholz, D. Jackel and B. J. Kroger, “Simulation of losses due to turbulence in the time-varying vocal system,” IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007.
    [14] P.Woodland, G. Evermann, M. Gales , HTK,http://htk.eng.cam.ac.uk/
    [15] S. Imai, “Cepstral anaslysis synthesis on the mel frequency scale,” in Proc. ICASSP-83, Boston, Masachusetts, USA, pp93-96, 1983.
    [16] SPTK Working Group , Speech Signal Processing Toolkit (SPTK), http://sp-tk.sourceforge.net/
    [17] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book( for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
    [18] T. Yoshimuray, K. Tokuday, T. Masukoyy, T. Kobayashiyyand T. Kitamuray, “Duration modeling for hmm-based speech synthesis, ” in Proc. ICSLP, Sydney, Australia,pp.29-32,1998.
    [19] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, “Duration modeling for hmm-dased speech synthesis system,” Proc. of ICSLP, Sydney, Australia, Vol. 2, pp.29-32, 1998.
    [20] wikiPedia, “Lagrange polynomial, ”
    [21] Y. Stylianou, “Applying the harmonic plus noise model in concatenative speech synthesis,” IEEE Trans. Speech and Audio Processing, Vol. 9, No. 1, pp. 21-29, 2001.
    [22] Y.Stylinaou, “Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification”, Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.
    [23] Y. Stylianou, “Modeling speech based on harmonic plus noise models”, in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp. 244-260, 2005.
    [24] 李振宇、林奇嶽 ,“使用隱藏式馬可夫模型為基礎建立中文語音合成系統”,ICL TECHNICAL JOURNAL pp.88-94, 2010.
    [25] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程研究所碩士論文,2007.
    [26] 陳安璿,整合MIDI伴奏之歌唱聲合成系統,國立台灣科技大學資訊工程研究所碩士論文,台北,2004.
    [27] 曹亦岑,使用小型語料類神經網路之國語語音合成韻律參數產生,國立台灣科技大學資訊工程研究所碩士論文,1999.
    [28] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,2009.
    [29] 蔡松峰,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程研究所碩士論文,2009.
    [30] 簡延庭,基於HMM模型之歌聲合成與音色轉換,國立台灣科技大學資訊工程研究所碩士論文,2013.
    [31] H.-Y. Gu, S.-F. Tsai,“A Discrete-cepstrum Based Spectrum-envelope Estimation Scheme and Its Example Application of Voice Transformation”, Computational Linguistics and Chinese Language Processing ,Vol. 14, No. 4, pp. 363-382., 2009.

    QR CODE