用於語音合成之聲、韻母時長正規化與預測方法｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	劉子揚 LIOU-ZIH-YANG
論文名稱：	用於語音合成之聲、韻母時長正規化與預測方法 Normalization and Prediction of Syllable Initial and Final Durations for speech Synthesis
指導教授：	古鴻炎 Hung-Yan Gu
口試委員:	王新民 Hsin-Min Wang 余明興 Ming-Shing Yu 鍾國亮 Kuo-Liang Chung 古鴻炎 Hung-Yan Gu
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2017
畢業學年度：	105
語文別：	中文
論文頁數：	72
中文關鍵詞：	語音合成、時長預測、正規化
外文關鍵詞：	Speech synthesis, Suration prediction, Normalization
相關次數：	點閱：159 下載：7
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本論文研究了聲、韻母時長之正規化方法，並且設計了特徵集給
Weka 軟體去建造分類迴歸樹，用以預測欲合成文句之聲、韻母時長，
希望結合兩者(時長正規化、分類迴歸之時長預測)，以提升合成語音
在聲、韻母時長配置上的自然度。在訓練階段，依據訓練語句的標記
檔取得聲、韻母的原始時長，然後使用所提出的雙層式標準差匹配法
去對聲、韻母時長作正規化，再使用Weka 軟體去建造聲、韻母時長
各自的分類迴歸樹。在合成階段，我們將聲、韻母時長的分類迴歸樹
發展成預測程式模組，並且整合至前人發展的語音合成系統，再使用
我們方法預測出的聲、韻母時長及整合的系統去合成出語音信號。接
著，使用合成出的音檔去進行合成語音自然度比較及合成語音自然度
MOS 評分的聽測實驗，由合成語音自然度比較的平均評分可發現，我
們研究的聲、韻母時長產生方法，在時長配置之自然度方面，確實比
前人方法所合成出的音檔更貼近於人類的說話方式；在自然度MOS 評
分方面，我們的合成音檔的平均評分都可達3.5 分以上，最好的一個
音檔則已超過4 分，這表示大部分受測者都肯定我們的合成音檔，已
經很接近真人錄音的說話方式。

In this thesis, normalization methods for syllable
initial and final durations are studied. Also, a feature set
is designed for Weka to construct classification and regression
trees (CART) to predict the syllable initial and final
durations of a text sentence to be synthesized. We hope to
combine the two studies (duration normalization and duration
prediction in terms of CART),to increase the naturalness level
of the synthesized speech especially in the arrangement of
initial an final durations. In the training stage, the original
durations of syllable initial and final are obtained by reading
the corresponding label file of a training sentence. Then, the
method, two level standard deviation matching, proposed here
is used to normalize the durations of syllable initials and
finals. Next, the software, Weka, is used to construct two CART
trees for the durations of syllable initials and finals
respectively. In the synthesis stage, we develop program
modules to predict the duration of a syllable initial or final
according to the two CART constructed by Weka. Then these
program modules are integrated to the speech synthesis system
developed by predecessor researchers. Hence, the system can
synthesize speech signals according to the duration
normalization and prediction methods studied in this thesis.
By using the synthesized speechs, we conduct two types of
listening tests including naturalness level comparison and
naturalness level MOS evaluation. According to the average
scores obtained from the listening tests, naturalness level
comparison, the duration prediction method studied here is
indeed better than the method provided by predecessor
researchers. This is because the arrangement of syllable
initial and final durations by our method is more natural. In
addition, according to the average scores obtained from the
listening tests, naturalness level MOS evaluation, most
participants agree that the synthetic speechs by using our
duration prediction method are very close to the corresponding speechs uttered by a real speaker. In details, the average
scores of our synthetic speechs are all greater than 3.5 points,
and one of them is greater than 4 points. Therefore, the
naturalness level of the synthetic speechs by using our
duration normalization and prediction methods is very close to
the speechs uttered by a real person.

摘要 ............................................................................................................................................ I
目錄 ........................................................................................................................................... V
圖表索引 ................................................................................................................................ VIII
第1 章 緒論 ............................................................................................................................. 1
1 研究動機 ....................................................................................................................... 1
2 文獻探討 ....................................................................................................................... 1
2.1 語音合成方法回顧 ............................................................................................... 2
2.2 韻律參數產生 ....................................................................................................... 5
2.3 時長正規化方法 ................................................................................................... 6
2.4 時長預測方法 ....................................................................................................... 7
3 研究方法 ....................................................................................................................... 8
4 論文章節簡述 ............................................................................................................. 11
第2 章 訓練語料與特徵集之製作 ....................................................................................... 12
1 語料準備 ..................................................................................................................... 12
2 特徵集之各項屬性 ..................................................................................................... 15
第3 章 音節時長與韻母時長之正規化 ............................................................................... 21
1 前人之時長正規化方法 ............................................................................................. 21
2 迴歸係數估計 ............................................................................................................. 22
3 音節時長正規化 ......................................................................................................... 24
5 韻母時長正規化—韻母標準差匹配法 ..................................................................... 25
6 韻母時長正規化—雙層標準差匹配法 ...................................................................... 26
7 韻母時長正規化—串接式正規化法 .......................................................................... 28
8 正規化方法之實驗 ...................................................................................................... 28
第4 章 韻母時長之預測 ....................................................................................................... 35
1 WEKA 軟體簡介 ............................................................................................................ 35
2 CART 演算法 ................................................................................................................ 36
3 Weka 作分類迴歸分析之步驟 .................................................................................... 38
4 Weka 之時長預測誤差量測結果 ................................................................................ 42
4.1 Weka 演算法之選擇 ............................................................................................ 42
4.2 音節時長正規化方法之Weka 時長預測實驗 ................................................... 43
4.3 韻母時長正規化方法之Weka 時長預測實驗 ................................................... 43
4.4 聲母時長之Weka 時長預測實驗 ....................................................................... 44
4.5 比較TLSDM+Weka(M5P)法與他人之時長預測法 ............................................... 45
5 預測聲、韻母時長之程式模組製作 .......................................................................... 46
第5 章 語音合成系統整合 ................................................................................................... 49
1 原系統之功能 ............................................................................................................. 49
2 加入時長預測模組 ..................................................................................................... 50
3 系統介面 ..................................................................................................................... 53
4 時長預測之測試 ......................................................................................................... 55
5 聽測實驗 ..................................................................................................................... 59
5.1 合成語音自然度比較 ......................................................................................... 59
5.2 合成語音自然度MOS 評分 ................................................................................. 62
第6 章 結論 ........................................................................................................................... 64
參考文獻 ................................................................................................................................. 67
                                

[1] M. M. Sondhi and J. Schroeter, “ A Hybrid Time-frequency Domain
Articulatory Speech Synthesizer ”, IEEE Transactions on Acoustics, Speech,
and Signal Processing, Vol. ASSP-35, No. 7, July, 1987.
[2] A. J. Hunt and A. W. Black , “ Unit Selection in a Concatenative Speech
Synthesis System Using a Large Speech Database ”, Int. Conf. on Acoustics,
Speech, and Signal Processing, Atlanta, USA, 1996.
[3] 楊仲捷，基於VQ/HMM 之國語語音合成基週軌跡產生之研究，碩士論
文，國立台灣科技大學電機所，1999。
[4] A. Ljolje and F. Fallside, “ Synthesis of Natural Sounding Pitch Contours in
Isolated Utterances Using Hidden Markov Models ”, IEEE Trans. Vol. 34,
No. 5, pp. 1074-1080, 1986.
[5] 吳伯彥，基於類神經網路之中文語音停頓預估，碩士論文，國立交通大
學電信工程所，2015。
[6] 簡敏昌，基於VQ/HMM 之音節音長與振幅產生之研究，碩士論文，國
立台灣科技大學電機工程系，2000。
[7] M. Y. Lai and S. F. Tsai, “ A Mandarin Speech Synthesis System
Combining HMM Spectrum Model and ANN Prosody Model ”,
IEEE Chinese Spoken Language Processing (ISCSLP) , Tainan, Taiwan,
2010.
[8] 謝喬華，考慮語速影響之漢語韻律模型建立與語音合成之應用，碩士論
文，國立交通大學電信工程所，2011。
[9] S. H. Chen, S. H. Hwang, and Y. R. Wang, “ AnRNN-based prosodic
information synthesizer for Mandarin text-to-speech ”, IEEE trans. Speech
and Audio Processing, Vol. 6, 1998, pp. 226-239.
[10] A. Lazaridis, P. Zervastal and G. Kokkinakis, “ Segmental duration
modeling for Greek Speech Synthesis ”, IEEE Tools with Artificial
Intelligence ,Patras, Greece, 2007.
[11] S. S. Nikiü and I. S. Nikiü, “ The Development of Phone Duration Model in
Speech Synthesis in the Serbian Language ” , IEEE Telecommunications
Forum Telfor (TELFOR), Belgrade, Serbia, 2015.
[12] Q. Guo, N. Kate, H.Yu, and H. Iwamida, “ Decision Tree based Duration
Prediction in Mandarin TTS System ” , IEEE Natural Language Processing
and Knowledge Engineering, Wuhan, China, 2005.
[13] 姜愷威，結合ANN,全域變異數與真實軌跡挑選之基週軌跡產生之改進
方法，碩士論文，國立台灣科技大學資訊工程系，2015。
[14] 賴名彥，結合HMM 頻譜模型與ANN 韻律模型之國語語音合成系統，碩士
論文，國立台灣科技大學資訊工程系，2009。
[15] S. Young, “ The HTK Hidden Markov Model Toolkit: Design and
Philosophy ”, Tech Report TR.153, Department of Engineering, Cambridge
University (UK), 1993.
[16] K. Sjolander and J. Beskow, Centre of Speech Technolodgy at KTH,
http://www.speech.kth.se/wavesurfer/
[17] K. Tokuda, et al. “ Speech Synthesis Based on Hidden Markov Models ”,
Proceedings of the IEEE 101.5 (2013): 1234-1252.
[18] 古鴻炎、張家維、王讚緯“以線性多變量迴歸來對映分段後音框之語音
轉換方法”，第24 屆自然語言與語音處理研討會，中壢，台灣，2012。
[19] 台灣維基百科http://www.twwiki.com/wiki/WEKA。
[20] 維基百科https://zh.wikipedia.org/wiki/%E5%86%B3%E7%AD%96%E6%
A0%91%E5%AD%A6%E4%B9%A0。
[21] 博客園http://www.cnblogs.com/church/p/4204935.html。
[22] 陳健勛，各類機器學習方法在太陽黑子數目預測上之效能評估，碩士論
文，亞洲大學生物與醫學資訊學系，2013。
[23] 銳之鋒芒，CSDNhttp://blog.csdn.net/roger__wong/article/details/39453865
[24] 昨日部落格http://yester-place.blogspot.tw/2008/07/opencv_26.html。
[25] 吳昌益，使用頻譜演進模型之國語語音合成研究，碩士論文，國立台灣
科技大學資訊工程研究所，2007。
[26] Y. Stylinaou, “ Harmonic plus noise models for speech, combined with
statistical methods, for speech and speaker modification ”, Ph. D. thesis,
Ecole National Superieure des Telecommunications, Paris, France, 1996.
[27] Y. Stylianou, “ Modeling speech based on harmonic plus noise models ”, in
Nonlinear Speech Modeling and Applications, eds. G. Chollet et al.,
Springer-Verlag, Berlin, pp. 244-260, 2005.
[28] 張世穎，結合HTS 頻譜模型與ANN 韻律模型之國語語音合成系統，碩
士論文，國立台灣科技大學資訊工程所，2013。
[29] 蔡松峯，GMM 為基礎之語音轉換法的改進，碩士論文，國立台灣科技
大學資訊工程所，2009。
[30] H. Silén, E. Helander, J. Nurminen and M. Gabbouj, “ Analysis of Duration
Prediction Accuracy in HMM-Based Speech Synthesis ”, Department of
Signal Processing, Tampere University of Technology, Tampere, Finland,
2010.
[31] S. Young, G. Evermann, T. Hain, D. Kershaw, G.Moore, J. Odell, D.
Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book(for HTK
version 3.2.1), Cambridge University Engineering Department, 2002.
[32] H. Zen, K. Tokuda, K. Oura, K. Hashimoto, S. Shiota, S. Takaki, J.
Yamagishi, T. Toda, T. Nose, S. Sako, A. W. Black, HMM-based Speech
Synthesis System(HTS). http://hts.sp.nitech.ac.jp/
[33] I. Satoshi, K. Sumita, and C. Furuichi, “ Mel-log Spectrum Approximation
(MLSA) Filter for Speech Synthesis ”, Transactions of the IECE of Japan,
J66-A:122–129, February 1983.

簡易檢索 / 詳目顯示

相關論文