簡易檢索 / 詳目顯示

研究生: 姜愷威
Kei-wei Chiang
論文名稱: 結合ANN, 全域變異數與真實軌跡挑選之基週軌跡產生之改進方法
Improved Pitch-contour Generation Methods Combing ANN, Global Variance and Real-contour Selection
指導教授: 古鴻炎
Hung-Yan Gu
口試委員: 王新民
Hsin-Min Wang
余明興
Ming-Shing Yu
范欽雄
Chin-Shyurng Fahn
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 87
中文關鍵詞: 真實軌跡挑選全域變異數類神經網路離散餘弦轉換係數基週軌跡變異數比值
外文關鍵詞: variance ratio, real-contour selection
相關次數: 點閱:258下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文提出一種結合類神經網路(ANN)、全域變異數(GV)調整與真實基週軌跡挑選之音節基週軌跡產生方法,可用以改善ANN產生之基週軌跡過度平滑的現象,並且可提升合成語音音調的自然度。在模型訓練階段,為了解決音高偵測錯誤的問題,我們分析錯誤之種類,再以程式對錯誤的音高值作更正,然後將各音節的基週軌跡轉換成DCT係數,用以訓練ANN模型、GV參數,此外也把各音節的DCT係數向量作分類儲存。在基週軌跡產生階段,以一個句子的文脈資料作為輸入,先令ANN預測出表示基週軌跡之DCT係數;接著依據GV參數來對各維度的DCT係數作調整,以疏解前述之過度平滑現象;此外,為了進一步提升合成語音的音調自然度,我們再依據GV調整後的DCT向量,到預先分類儲存之真實基週軌跡中進行挑選,以作為最後產生出的音節基週軌跡。關於所提出方法之客觀評估,我們量測了幾種選項設定之下的變異數比值(VR),一般來說,GV調整設的放大係數數值越大,得到的VR值會越高;此外,主觀聽測的結果顯示,以適當的放大係數值去作GV調整,確實可改善音調的自然度,並且加入真實軌跡挑選之步驟,可進一步提升合成語音之音調自然度。


    In this thesis, we propose an improved syllable pitch-contour generation method that combines ANN (artificial neural network), global variance and real-contour selection. This method not only alleviates the phenomenon of over-smoothed pitch-contour generated by ANN but also improves the naturalness level of the synthetic pitch contour. In the training stage, the automatically detected pitch contours are checked manually for some types of errors, and then corrected in terms of a program developed here. Next, each syllable pitch contour is transformed into DCT (discrete cosine transform) coefficients. Such DCT coefficients are then used to train ANN model and GV (global variance) parameters, and saved separately according to some context classification modes. In the generation stage, the ANN is used first to predict the DCT coefficients of each syllable pitch-contour according to the inputted contextual information items. Then, the generated DCT coefficients are adjusted by means of GV matching for each DCT vector dimension in order to alleviate the over-smoothing phenomenon mentioned above. Moreover, to promote the naturalness level of the synthetic pitch contours, we base on the DCT vector generated by ANN and adjusted by GV matching to select a real pitch contour from the saved contour pool corresponding to the requested context class. As for the objective assessment of our proposed method, we measure the VRs (variance ratio) under different option setting. It found that the higher VR value will be obtained if the larger weight for GV adjusting is used. In addition, the results of subjective listening tests demonstrate that an appropriate weight value for GV adjusting will improve the naturalness level of the generated pitch contour, and the processing step of real-contour selection will further improve the naturalness level.

    摘要 I ABSTRACT II 誌謝 III 目錄 IV 圖表索引 VI 第1章 緒論 1 1.1 研究動機 1 1.2 文獻探討 2 1.2.1 語音合成方法回顧 2 1.2.2 韻律參數產生 4 1.2.3 基週軌跡產生 6 1.3 研究方法 6 1.4 論文章節簡述 11 第2章 訓練語料與基週軌跡擷取 12 2.1 語料準備 12 2.2 以SPTK求取基週軌跡 13 2.3 基週錯誤之補救 17 2.4 離散餘弦轉換 23 第3章 基週軌跡類神經網路 28 3.1 類神經網路簡介 28 3.2 基週軌跡ANN之結構 30 3.3 類神經網路參數 32 3.3.1 輸入、輸出參數 32 3.3.2 DCT係數正規化 34 3.4 節點數實驗 36 第4章 GV調整與真實基週軌跡挑選 38 4.1 全域變異數調整之前的其他嘗試 38 4.1.1 依韻母分類調整標準差 38 4.1.2 依聲調組合分類調整標準差 41 4.2 全域變異數簡介 43 4.3 全域變異數參數訓練 43 4.4 合成階段之GV調整方式 44 4.5 真實基週軌跡挑選 46 4.5.1 真實音段條件 46 4.5.2 候選音節選擇 47 第5章 客觀量測與主觀聽測實驗 50 5.1 客觀量測 50 5.1.1 平均DCT距離差距 51 5.1.2 變異數比值 52 5.2 主觀聽測 57 5.2.1 男性合成語音聽測 59 5.2.2 女性合成語音聽測 61 5.2.3 投票比較各組產生方法之自然度 64 第6章 結論 67 參考文獻 70

    [1] I. Satoshi, K. Sumita, and C. Furuichi, “Mel-log Spectrum Approximation (MLSA) Filter for Speech Synthesis”, Transactions of the IECE of Japan, J66-A:122–129, February 1983.
    [2] H. Kawahara, A. deCheveign’e, H. Banno, T. Takahashi and T. Irino, “Nearly Defect-free F0 Trajectory Extraction for Expressive Speech Modifications Based on STRAIGHT”, In Interspeech, pp. 537–540, Lisboa, 2005.
    [3] Y. Stylinaou, “Harmonic Plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification”, Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.
    [4] L. S. Lee, C. Y. Tseng, C. J. Hsieh, "Improved Tone Concatenation Rules in a Formant-based Chinese Text-to-speech System", IEEE Transactions on Speech and Audio Processing, Vol. 1, No. 3, July 1993
    [5] H. B. Chiou, H. C. Wang, and Y. C. Chang, “Synthesis of Mandarin Speech Based on Hybrid Concatenation,” Computer Processing of Chinese and Oriental Languages, 5(1), 1991, pp. 217-231.
    [6] 曹亦岑,使用小型語料類神經網路之國語語音合成韻律參數產生,國立台灣科技大學電機所碩士論文,1999。
    [7] J. R. Bellegarda, Kim E. A. Silverman, K. Lenzo, V. Anderson, “Statistical Prosodic Modeling: From Corpus Design to Parameter Estimation”, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 1, January, 2001
    [8] C. H. Wu, J. H. Chen, “Prosody Generation in a Chinese TTS System Based on a Hierarchical Word Prosody Template Tree”, Proceedings of ROCLING X International Conference (Taipei), pp.262-266,1997.
    [9] Chen S. H, S. M. Lee, “A Statistical Model Based Fundamental Frequency Synthesizer for Mandarin Speech”, J. Acoust. Soc. Am. 92, No. 1, pp. 114-120, 1992.
    [10] 潘能煌,中文文句翻語音系統之音量音調韻律研究,國立中興大學應用數學系碩士論文,1998。
    [11] 楊仲捷,基於VQ/HMM之國語語音合成基週軌跡產生之研究,國立台灣科技大學電機所,碩士論文,1999。
    [12] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程系碩士論文,2007。
    [13] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程系碩士論文,2009。
    [14] SPTK Working Group, Speech Signal Processing Toolkit (SPTK), http://sp-tk.sourceforge.net/
    [15] M. M. Sondhi, J. Schroeter, “A Hybrid Time-frequency Domain Articulatory Speech Synthesizer”, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-35, No. 7, July, 1987
    [16] A. J. Hunt and A. W. Black , “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database", Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings.
    [17] S. Young, G. Evermann, T. Hain, D. Kershaw, G.Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book(for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
    [18] H. Zen, K. Tokuda, K. Oura, K. Hashimoto, S. Shiota, S. Takaki, J. Yamagishi, T. Toda, T. Nose, S. Sako, A. W. Black, HMM-based Speech Synthesis System(HTS). http://hts.sp.nitech.ac.jp/
    [19] 林祐靖,結合HMM頻譜模型與ANN抖音模型之國與歌聲合成,國立台灣科技大學資訊工程系碩士論文,2014。
    [20] 曾聖文,使用頻譜HMM模型及波型包絡模型之曲笛聲合成,國立台灣科技大學資訊工程系碩士論文,2014。
    [21] T. Toda, S. Young, “Trajectory Training Considering Global Variance for HMM-based Speech Synthesis”, Acoustics, Speech and Signal Processing, 2009.
    [22] S.Imai, “Cepstral Analysis Synthesis on the Mel Frequency Scale”, in Proc.ICASSP-83, pp.93-96, 1983, Boston, Massachusetts, USA.
    [23] H. T. Hwang, Y. Tsao, H. M. Wang, Y. R. Wang, S. H. Chen, “Incorporating Global Variance in the Training Phase of GMM-based Voice Conversion”, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013.
    [24] 王讚緯,使用直方圖等化及目標音框挑選之語音轉換系統,國立台灣科技大學資訊工程系碩士論文,2014。
    [25] M. Reidi, “A Neural-network-based Model of Segmental Duration for Speech Synthesis”, Proc. Eurospeech, Madrid, pp.599-602, 1995.
    [26] T. Fukada, Y. Komori, T. Aso and Y. Ohara, “A Study of Pitch Pattern Generation using HMM-based Statistical Information”, in Proc. ICSLP-94, pp.723-726, 1994.
    [27] A. Ljolje and F. Fallside, “Synthesis of Natural Sounding Pitch Contours in Isolated Utterances Using Hidden Markov Models”, IEEE Trans. Vol. 34, No. 5, pp. 1074-1080, 1986
    [28] C. Traber, “F0 Generation with a Data Base of Natural F0 Patterns and with a Neural Network”, In SSW1-1990, pp. 141-144, 1990.
    [29] D. Larreur,F. Emerard,F. Marty, “Linguistic and Prosodic Processing for a Text-to-speech Synthesis System”, Proc. Eurospeech, Paris, 1989
    [30] illinois university, HTK, “Forced Alignment”, https://netfiles,uiuc,edu/tyoon/www/ForcedAlignment.htm.
    [31] K. Sjolander and J. Beskow, Centre of Speech Technolodge at KTH, http://www.speech.kth.se/wavesurfer/
    [32] MathWorks, MATLAB, http://www.mathworks.com/products/matlab/
    [33] WikiPedia, “Lagrange polynomial”.
    [34] National Programme on Technology Enhanced Learning, http://nptel.ac.in/courses/117104069/
    [35] 曹亦岑,使用小型語料類神經網路之國語語音合成韻律參數產生,國立台灣科技大學電機所碩士論文,1999。
    [36] S. J. Lee, K. C. Kim, H. Y. Jung, W. Cho, “Application of Fully Recurrent Neural Networks for Speech Recognition”, ICASSP, pp. 77-80, 1991.
    [37] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程系碩士論文,2009。
    [38] 郭威志,使用語者辯認作前處理之國語TTS系統發展,國立交通大學電信研究所碩士論文,2000。
    [39] T. Toda, K. Tokuda, “A Speech Parameter Generation Algorithm Considering Global Variance for HMM-based Speech Synthesis,” IEICE Trans. Inf. & Syst., Vol. E90-D, No. 5 May 2007.
    [40] 簡延庭,基於HMM模型之歌聲合成語音色轉換,國立台灣科技大學資訊工程所碩士論文,2013。
    [41] E. Godoy, O. Rosec and T. Chonavel, "Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora", IEEE trans. Audio, Speech, and Language Processing, vol. 20, pp. 1313-1323, 2012.
    [42] 蔡松峯,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
    [43] 洪尉翔,使用MGE訓練之HMM模型及全域變異數匹配之合成語音信號音質改進方法,國立台灣科技大學資訊工程所碩士論文,2015。
    [44] 張世穎,結合HTS頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程所碩士論文,2013。
    [45] S.H.Chen,S.H.Hwang,andY.R.Wang,“AnRNN-based prosodic information synthesizer for Mandarin text-to-speech,”IEEE trans. Speech and Audio Processing, Vol. 6, 1998, pp. 226-239.
    [46] C. T. Lin, R. C. Wu, J. Y. Chang, and S. F. Liang, “A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system”, IEEE trans. Systems, Man, and Cybernetics, Vol. 34, 2004, pp. 309-324.
    [47] J. Teutenberg, C. Watson, P. Riddle, “Modeling and Synthesizing f0 Contours with the Discrete Cosine Transform”, IEEE International Conference on ICASSP, 2008.
    [48] G. L. Jayavardhana Rama, A. G. Ramakrishnan, R. Muralishankar, P. Prathibha, “A Complete Text-to-Speech Synthesis System in Tamil”, Proceedings of IEEE Workshop on Speech Synthesis, 2002.

    QR CODE