Basic Search / Detailed Display

Author: 賴名彥
Ming-Yen Lai
Thesis Title: 結合HMM頻譜模型與ANN韻律模型之國語語音合成系統
A Mandarin Speech Synthesis System Combining HMM Spectrum Model and ANN Prosody Model
Advisor: 古鴻炎
Hung- yan Gu
Committee: 余明興
Ming-Shing Yu
王新民
Hsin-Min Wang
林伯慎
Bor-Shen Lin
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2009
Graduation Academic Year: 98
Language: 中文
Pages: 87
Keywords (in Chinese): 文字轉語音系統隱藏式馬可夫模型頻譜演進類神經網路
Keywords (in other languages): TTS, HMM, ANN
Reference times: Clicks: 226Downloads: 4
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

本論文提出一種結合ANN韻律模型與HMM頻演模型的國語語音合成之架構。在訓練階段,對各個訓練語料音框算出DCC係數,以作為頻譜特徵參數,接著,對於一種音節的多個發音,依DTW匹配出的頻演路徑作分群,各群建立一個HMM,並記錄各音節發音的文依性資訊。在合成階段,首先依據文依性資訊挑選出輸入文句各音節的HMM模型,接著我們研究了一種HMM狀態無、有聲邊界之判定方法,然後使用音長ANN模型及狀態平均時長來決定HMM各狀態應該產生的音框數。除了前人提出的MLE法,我們另外研究三種內插方法來產生各音框的DCC係數,以讓語音合成的速度達到即時處理。接著依據DCC係數轉出的頻譜包絡,及另一個ANN產生出的基週軌跡,去控制HNM作語音信號的合成。聽測實驗的結果顯示,使用所提出的加權式線性內插法來產生DCC係數,合成出的語音信號會比使用MLE法的,具有明顯較高的自然度;另外,使用ANN音長參數,也比使用HMM狀態本身的平均音長,會獲得明顯較高的自然度。


In this thesis, a Mandarin speech synthesis system that combines HMM spectrum model and ANN prosody model is proposed. In the training phase, DCC (discrete cepstrum coefficients) are computed for each frame of the training corpus and used as spectral parameters. For multiple utterances of a same syllable, we first group them into a few clusters according to their DTW (dynamic time warping) paths. Then, each cluster is used to train an HMM (hidden Markov model). In addition, each syllable utterance’s HMM number and its contextual data is saved. In the synthesis phase, for each syllable of an input sentence, an HMM of the syllable is selected first according to this syllable’s contextual data. For a selected HMM, we have studied a way to split its states into unvoiced and voiced ones. Then, we use duration ANN (artificial neural network) and duration means of HMM states to decide how many frames an HMM state should be assigned. Besides the MLE (maximum likelihood estimate) method proposed by previous researchers, to achieve the goal of real-time synthesis, we also study three more types of interpolation methods to generate DCC coefficients for each frame. Next, speech signal is synthesized by using the spectral envelope derived from DCC coefficients and the pitch contour generated by another ANN to control an HNM (harmonic-plus-noised-model) based signal synthesizer. The results of perception tests show that the speech signal synthesized by the weighted linear interpolation method proposed here is significantly natural than the speech signal synthesized by the MLE method. In addition, the speech signal synthesized by using the duration ANN is also significantly natural than the speech signal synthesized by using the duration means of HMM states.

摘要 I ABSTRACT II 誌謝 III 目錄 IV 圖表索引 VII 第1章 緒論 1 1.1 研究動機及目的 1 1.2 文獻回顧 2 1.2.1 系統架構 2 1.2.2 頻譜演進 4 1.2.3 語音信號合成方法回顧 5 1.3 研究方法 8 1.4 論文架構 11 第2章 語料準備 13 2.1 語料預處理 13 2.1.1 錄音、語料 13 2.1.2 標音、forced alignment、切音 13 2.2 頻譜參數求取 17 2.2.1 頻譜包絡估計架構 18 2.2.2 離散倒頻譜 19 2.2.3 頻譜參數計算 20 第3章 語音模型訓練 23 3.1 基於頻演路徑之分群 23 3.1.1 分群之目的 23 3.1.2 音節發音之分群方法 25 3.1.3 群聚適切性評估 27 3.1.4 分群實作 28 3.2 隱藏式馬可夫模型 31 3.3 HMM模型訓練 32 3.3.1 初始模型 32 3.3.2 分段K中心法 33 3.3.3 維特比搜尋 36 3-4 狀態時長參數之訓練 38 3-5 音長ANN模型訓練 40 3.5.1 類神經網路結構 40 第4章 DCC係數產生方法 44 4.1 HMM挑選 44 4.2 HMM狀態駐留長度 47 4.3 音框DCC係數產生 48 4.3.1 最大似然法 48 4.3.2 線性內插法 52 4.3.3 加權式線性內插 53 4.3.4 拋物線內插法 55 4.3.5 逼近誤差量測 56 第5章 語音合成系統建造 60 5.1 HMM狀態之有聲無聲判斷 60 5.2 韻律參數產生 61 5.2.1 音量 61 5.2.2 音長 63 5.2.3 基週軌跡 63 5.3 HNM信號產生 64 5.4 程式介面 66 第6章 語音合成實驗與聽測 70 6.1 語音合成實驗 70 6.1.1 內部語句 70 6.1.2 外部語句 71 6.1.3 真實與合成頻譜之比較 72 6.1.4 系統執行速度 73 6.2 聽測評估方式 74 6.3 聽測評估結果 76 6.3.1 聽測項目-兩種DCC係數之產生方法 76 6.3.2 聽測項目-兩種語料量 77 6.3.3 聽測項目-兩種音節時長產生方法 78 6.3.4 聽測項目-不同的語音合成系統 79 第7章 結論 80 參考文獻 83 作者簡介 87

[1] Yannis Stylianou, “Applying the Harmonic plus Noise Model in Concatenative Speech Synthesis”, IEEE Trans. Speech and Audio Processing, Vol. 9, No. 1, pp. 21-29, 2001.
[2] 周福強,以語料庫為基礎之新一代中文文句翻語音合成技術,國立臺灣大學電機工程研究所,1998。
[3] 楊叡承,以華台雙語資訊及韻律調整為改進之台語文字轉語音系統,長庚大學資訊工程研究所碩士論文,2002。
[4] O. Capp´e and E. Moulines, “Regularization techniques for discrete cepstrum estimation,” IEEE Signal Processing Letters, vol. 3, no. 4, pp. 100–102, 1996.
[5] 古鴻炎、蔡松峯,”基於離散倒頻譜之頻譜包絡估計架構及其於語音轉換之應用”,投稿於第二十一屆自然語言與語音處理研討會(ROCLING 2009),台中,2009。
[6] Kåre Sjölander and Jonas Beskow, Centre of Speech Technolodge at KTH, http://www.speech.kth.se/wavesurfer/
[7] HTK, “ForcedAlignment,”
https://netfiles.uiuc.edu/tyoon/www/ForcedAlignment.htm
[8] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, “Duration Modeling in HMM-based Speech Synthesis System,“ Proc. of ICSLP, Sydney, Australia , Vol. 2, pp. 29–32, 1998.
[9] 吳昌益, 使用頻譜演進模型之國語語音合成研究, 國立台灣科技大學資訊工程研究所碩士論文,2007。
[10] Hideki Banno, Hiroaki Hata, Masanori Morise, Toru Takahashi, Toshio Irino, Hideki Kawahara, "Implementation of realtime STRAIGHT speech manipulation system", Acoust. Sci. & Tech. 2007. Vol.28, No.3, pp.140--146, 2007.
[11] S. Imai, “Cepstral analysis synthesis on the me1 frequency scale,” in Proc. ICASSP-83, Boston, Massachusetts, USA, pp.93-96, 1983.
[12] Qian yao, Soong, F.K. , Chen, Y.N and Chu, M.,”An HMM-Based Mandarin Chinese Text-To-Speech System”, in Proc. ISCSLP 2006, Kent Ridge, Singapore, Springer LNAI Vol. 4274, pp.223-232, 2006.
[13] 古鴻炎、張小芬、吳俊欣,”仿趙氏音高尺度之基週軌跡正規化方法及其應用”,第十六屆自然語言與語音處理研討會(ROCLING XVI),台北,第325-334 頁, 2004。
[14] 江克敬,華語韻律轉換之研究與實作,清華大學資訊工程研究所碩士論文,2008。
[15] D. O’Shaughnessy, Speech Communications: Human and Machine, IEEE Press, Piscataway, NJ, 2000.
[16] T. Galas and X. Rodet, “An improved cepstral method for deconvolution of source filter systems with discrete spectra: Application to musical sound signals”, Int. Computer Music Conference (ICMC), Glasgow, Scotland, pp. 82-44, 1990.
[17] 王小川,語音訊號處理(修訂二版),全華圖書公司,台北,2009。
[18] T. Yoshimura, K. Tokuda,T. Masuko, T. Kobayashi and T. Kitamura, ”Duration modeling for HMM-based speech synthesis,” in Proc. ICSLP, Sydney, Australia, pp.29-32,1998.
[19] Steve Young, Gunnar Evermann, Thomas Hain, Dan Kershaw, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, Phil Woodland, The HTK Book( for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
[20] Calinshi T., Haravasz J., “A dendrite method for cluster analysis,” IEEE Transactions on Pattern Analysis and machine Itelligence, Vol.1, NO.2, 1979.
[21] 曹亦岑,使用小型語料類神經網路之國語語音合成韻律參數產生,國立台灣科技大學電機所碩士論文,1999。
[22] 李雪貞,客語語音合成之初步研究,國立台灣科技大學資訊工程研究所碩士論文,2001。
[23] Lawrence Rabiner and Biing-Hwang Juang, “Fundamentals of Speech Recognition,” Pretice-Hall International, Inc.1993.
[24] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. of ICASSP 2000, Istanbul, Turkey, vo1.3, pp.1315-1318, June 2000.
[25] K. Tokuda, H. Zen, and A.W. Black. “An HMM-based speech synthesis system applied to English,” Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, Sep. 2002.
[26] 黃國勛, 行動裝置上語音命令辨識之研究, 國立台灣科技大學資訊工程研究所碩士論文,2007
[27] WikiPedia, “Hidden Markov model,”
http://en.wikipedia.org/wiki/Hidden_Markov_model
[28] K. Tokuda, T. Masuko, T. Yamada, T. Kobayashi and Satoshi Imai, “An Algorithm for Speech Parameter Generation from Continuous Mixture HMMs with Dynamic Features,” Proc. EUROSPEECH-95, Madrid, Spain, pp.757–760, 1995.
[29] WikiPedia, “Backpropagation,”
http://en.wikipedia.org/wiki/Backpropagation
[30] Meng Zhang, Jianhua Tao, Huibin Jia, Xia Wang,”Improving HMM Based Speech Synthesis by Reducing Over-Smoothing Problems,”Chinese Spoken Language Processing ISCSLP 6th International Symposium, Kunming, China, Dec. 2008.
[31] T. Toda, A.W. Black, K. Tokuda,” Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, Vol.15, No. 8, pp.2222- 2235, Nov. 2007.
[32] WikiPedia, “Lagrange polynomial,”
http://en.wikipedia.org/wiki/Lagrange_polynomial
[33] 楊善翔, 聲源三維方位偵測之研究, 國立台灣科技大學資訊工程研究所碩士論文,2009。
[34] 周彥佐, 基於HNM之國語、閩南語的語音合成研究, 國立台灣科技大學資訊工程研究所碩士論文,2007。

QR CODE