簡易檢索 / 詳目顯示

研究生: 劉子揚
LIOU-ZIH-YANG
論文名稱: 用於語音合成之聲、韻母時長正規化與預測方法
Normalization and Prediction of Syllable Initial and Final Durations for speech Synthesis
指導教授: 古鴻炎
Hung-Yan Gu
口試委員: 王新民
Hsin-Min Wang
余明興
Ming-Shing Yu
鍾國亮
Kuo-Liang Chung
古鴻炎
Hung-Yan Gu
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 72
中文關鍵詞: 語音合成時長預測正規化
外文關鍵詞: Speech synthesis, Suration prediction, Normalization
相關次數: 點閱:159下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文研究了聲、韻母時長之正規化方法,並且設計了特徵集給
    Weka 軟體去建造分類迴歸樹,用以預測欲合成文句之聲、韻母時長,
    希望結合兩者(時長正規化、分類迴歸之時長預測),以提升合成語音
    在聲、韻母時長配置上的自然度。在訓練階段,依據訓練語句的標記
    檔取得聲、韻母的原始時長,然後使用所提出的雙層式標準差匹配法
    去對聲、韻母時長作正規化,再使用Weka 軟體去建造聲、韻母時長
    各自的分類迴歸樹。在合成階段,我們將聲、韻母時長的分類迴歸樹
    發展成預測程式模組,並且整合至前人發展的語音合成系統,再使用
    我們方法預測出的聲、韻母時長及整合的系統去合成出語音信號。接
    著,使用合成出的音檔去進行合成語音自然度比較及合成語音自然度
    MOS 評分的聽測實驗,由合成語音自然度比較的平均評分可發現,我
    們研究的聲、韻母時長產生方法,在時長配置之自然度方面,確實比
    前人方法所合成出的音檔更貼近於人類的說話方式;在自然度MOS 評
    分方面,我們的合成音檔的平均評分都可達3.5 分以上,最好的一個
    音檔則已超過4 分,這表示大部分受測者都肯定我們的合成音檔,已
    經很接近真人錄音的說話方式。


    In this thesis, normalization methods for syllable
    initial and final durations are studied. Also, a feature set
    is designed for Weka to construct classification and regression
    trees (CART) to predict the syllable initial and final
    durations of a text sentence to be synthesized. We hope to
    combine the two studies (duration normalization and duration
    prediction in terms of CART),to increase the naturalness level
    of the synthesized speech especially in the arrangement of
    initial an final durations. In the training stage, the original
    durations of syllable initial and final are obtained by reading
    the corresponding label file of a training sentence. Then, the
    method, two level standard deviation matching, proposed here
    is used to normalize the durations of syllable initials and
    finals. Next, the software, Weka, is used to construct two CART
    trees for the durations of syllable initials and finals
    respectively. In the synthesis stage, we develop program
    modules to predict the duration of a syllable initial or final
    according to the two CART constructed by Weka. Then these
    program modules are integrated to the speech synthesis system
    developed by predecessor researchers. Hence, the system can
    synthesize speech signals according to the duration
    normalization and prediction methods studied in this thesis.
    By using the synthesized speechs, we conduct two types of
    listening tests including naturalness level comparison and
    naturalness level MOS evaluation. According to the average
    scores obtained from the listening tests, naturalness level
    comparison, the duration prediction method studied here is
    indeed better than the method provided by predecessor
    researchers. This is because the arrangement of syllable
    initial and final durations by our method is more natural. In
    addition, according to the average scores obtained from the
    listening tests, naturalness level MOS evaluation, most
    participants agree that the synthetic speechs by using our
    duration prediction method are very close to the corresponding speechs uttered by a real speaker. In details, the average
    scores of our synthetic speechs are all greater than 3.5 points,
    and one of them is greater than 4 points. Therefore, the
    naturalness level of the synthetic speechs by using our
    duration normalization and prediction methods is very close to
    the speechs uttered by a real person.

    摘要 ............................................................................................................................................ I 目錄 ........................................................................................................................................... V 圖表索引 ................................................................................................................................ VIII 第1 章 緒論 ............................................................................................................................. 1 1.1 研究動機 ....................................................................................................................... 1 1.2 文獻探討 ....................................................................................................................... 1 1.2.1 語音合成方法回顧 ............................................................................................... 2 1.2.2 韻律參數產生 ....................................................................................................... 5 1.2.3 時長正規化方法 ................................................................................................... 6 1.2.4 時長預測方法 ....................................................................................................... 7 1.3 研究方法 ....................................................................................................................... 8 1.4 論文章節簡述 ............................................................................................................. 11 第2 章 訓練語料與特徵集之製作 ....................................................................................... 12 2.1 語料準備 ..................................................................................................................... 12 2.2 特徵集之各項屬性 ..................................................................................................... 15 第3 章 音節時長與韻母時長之正規化 ............................................................................... 21 3.1 前人之時長正規化方法 ............................................................................................. 21 3.2 迴歸係數估計 ............................................................................................................. 22 3.3 音節時長正規化 ......................................................................................................... 24 3.5 韻母時長正規化—韻母標準差匹配法 ..................................................................... 25 3.6 韻母時長正規化—雙層標準差匹配法 ...................................................................... 26 3.7 韻母時長正規化—串接式正規化法 .......................................................................... 28 3.8 正規化方法之實驗 ...................................................................................................... 28 第4 章 韻母時長之預測 ....................................................................................................... 35 4.1 WEKA 軟體簡介 ............................................................................................................ 35 4.2 CART 演算法 ................................................................................................................ 36 4.3 Weka 作分類迴歸分析之步驟 .................................................................................... 38 4.4 Weka 之時長預測誤差量測結果 ................................................................................ 42 4.4.1 Weka 演算法之選擇 ............................................................................................ 42 4.4.2 音節時長正規化方法之Weka 時長預測實驗 ................................................... 43 4.4.3 韻母時長正規化方法之Weka 時長預測實驗 ................................................... 43 4.4.4 聲母時長之Weka 時長預測實驗 ....................................................................... 44 4.4.5 比較TLSDM+Weka(M5P)法與他人之時長預測法 ............................................... 45 4.5 預測聲、韻母時長之程式模組製作 .......................................................................... 46 第5 章 語音合成系統整合 ................................................................................................... 49 5.1 原系統之功能 ............................................................................................................. 49 5.2 加入時長預測模組 ..................................................................................................... 50 5.3 系統介面 ..................................................................................................................... 53 5.4 時長預測之測試 ......................................................................................................... 55 5.5 聽測實驗 ..................................................................................................................... 59 5.5.1 合成語音自然度比較 ......................................................................................... 59 5.5.2 合成語音自然度MOS 評分 ................................................................................. 62 第6 章 結論 ........................................................................................................................... 64 參考文獻 ................................................................................................................................. 67

    [1] M. M. Sondhi and J. Schroeter, “ A Hybrid Time-frequency Domain
    Articulatory Speech Synthesizer ”, IEEE Transactions on Acoustics, Speech,
    and Signal Processing, Vol. ASSP-35, No. 7, July, 1987.
    [2] A. J. Hunt and A. W. Black , “ Unit Selection in a Concatenative Speech
    Synthesis System Using a Large Speech Database ”, Int. Conf. on Acoustics,
    Speech, and Signal Processing, Atlanta, USA, 1996.
    [3] 楊仲捷,基於VQ/HMM 之國語語音合成基週軌跡產生之研究,碩士論
    文,國立台灣科技大學電機所,1999。
    [4] A. Ljolje and F. Fallside, “ Synthesis of Natural Sounding Pitch Contours in
    Isolated Utterances Using Hidden Markov Models ”, IEEE Trans. Vol. 34,
    No. 5, pp. 1074-1080, 1986.
    [5] 吳伯彥,基於類神經網路之中文語音停頓預估,碩士論文,國立交通大
    學電信工程所,2015。
    [6] 簡敏昌,基於VQ/HMM 之音節音長與振幅產生之研究,碩士論文,國
    立台灣科技大學電機工程系,2000。
    [7] M. Y. Lai and S. F. Tsai, “ A Mandarin Speech Synthesis System
    Combining HMM Spectrum Model and ANN Prosody Model ”,
    IEEE Chinese Spoken Language Processing (ISCSLP) , Tainan, Taiwan,
    2010.
    [8] 謝喬華,考慮語速影響之漢語韻律模型建立與語音合成之應用,碩士論
    文,國立交通大學電信工程所,2011。
    [9] S. H. Chen, S. H. Hwang, and Y. R. Wang, “ AnRNN-based prosodic
    information synthesizer for Mandarin text-to-speech ”, IEEE trans. Speech
    and Audio Processing, Vol. 6, 1998, pp. 226-239.
    [10] A. Lazaridis, P. Zervastal and G. Kokkinakis, “ Segmental duration
    modeling for Greek Speech Synthesis ”, IEEE Tools with Artificial
    Intelligence ,Patras, Greece, 2007.
    [11] S. S. Nikiü and I. S. Nikiü, “ The Development of Phone Duration Model in
    Speech Synthesis in the Serbian Language ” , IEEE Telecommunications
    Forum Telfor (TELFOR), Belgrade, Serbia, 2015.
    [12] Q. Guo, N. Kate, H.Yu, and H. Iwamida, “ Decision Tree based Duration
    Prediction in Mandarin TTS System ” , IEEE Natural Language Processing
    and Knowledge Engineering, Wuhan, China, 2005.
    [13] 姜愷威,結合ANN,全域變異數與真實軌跡挑選之基週軌跡產生之改進
    方法,碩士論文,國立台灣科技大學資訊工程系,2015。
    [14] 賴名彥,結合HMM 頻譜模型與ANN 韻律模型之國語語音合成系統,碩士
    論文,國立台灣科技大學資訊工程系,2009。
    [15] S. Young, “ The HTK Hidden Markov Model Toolkit: Design and
    Philosophy ”, Tech Report TR.153, Department of Engineering, Cambridge
    University (UK), 1993.
    [16] K. Sjolander and J. Beskow, Centre of Speech Technolodgy at KTH,
    http://www.speech.kth.se/wavesurfer/
    [17] K. Tokuda, et al. “ Speech Synthesis Based on Hidden Markov Models ”,
    Proceedings of the IEEE 101.5 (2013): 1234-1252.
    [18] 古鴻炎、張家維、王讚緯“以線性多變量迴歸來對映分段後音框之語音
    轉換方法”,第24 屆自然語言與語音處理研討會,中壢,台灣,2012。
    [19] 台灣維基百科http://www.twwiki.com/wiki/WEKA。
    [20] 維基百科https://zh.wikipedia.org/wiki/%E5%86%B3%E7%AD%96%E6%
    A0%91%E5%AD%A6%E4%B9%A0。
    [21] 博客園http://www.cnblogs.com/church/p/4204935.html。
    [22] 陳健勛,各類機器學習方法在太陽黑子數目預測上之效能評估,碩士論
    文,亞洲大學生物與醫學資訊學系,2013。
    [23] 銳之鋒芒,CSDNhttp://blog.csdn.net/roger__wong/article/details/39453865
    [24] 昨日部落格http://yester-place.blogspot.tw/2008/07/opencv_26.html。
    [25] 吳昌益,使用頻譜演進模型之國語語音合成研究,碩士論文,國立台灣
    科技大學資訊工程研究所,2007。
    [26] Y. Stylinaou, “ Harmonic plus noise models for speech, combined with
    statistical methods, for speech and speaker modification ”, Ph. D. thesis,
    Ecole National Superieure des Telecommunications, Paris, France, 1996.
    [27] Y. Stylianou, “ Modeling speech based on harmonic plus noise models ”, in
    Nonlinear Speech Modeling and Applications, eds. G. Chollet et al.,
    Springer-Verlag, Berlin, pp. 244-260, 2005.
    [28] 張世穎,結合HTS 頻譜模型與ANN 韻律模型之國語語音合成系統,碩
    士論文,國立台灣科技大學資訊工程所,2013。
    [29] 蔡松峯,GMM 為基礎之語音轉換法的改進,碩士論文,國立台灣科技
    大學資訊工程所,2009。
    [30] H. Silén, E. Helander, J. Nurminen and M. Gabbouj, “ Analysis of Duration
    Prediction Accuracy in HMM-Based Speech Synthesis ”, Department of
    Signal Processing, Tampere University of Technology, Tampere, Finland,
    2010.
    [31] S. Young, G. Evermann, T. Hain, D. Kershaw, G.Moore, J. Odell, D.
    Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book(for HTK
    version 3.2.1), Cambridge University Engineering Department, 2002.
    [32] H. Zen, K. Tokuda, K. Oura, K. Hashimoto, S. Shiota, S. Takaki, J.
    Yamagishi, T. Toda, T. Nose, S. Sako, A. W. Black, HMM-based Speech
    Synthesis System(HTS). http://hts.sp.nitech.ac.jp/
    [33] I. Satoshi, K. Sumita, and C. Furuichi, “ Mel-log Spectrum Approximation
    (MLSA) Filter for Speech Synthesis ”, Transactions of the IECE of Japan,
    J66-A:122–129, February 1983.

    QR CODE