研究生: |
姜愷威 Kei-wei Chiang |
---|---|
論文名稱: |
結合ANN, 全域變異數與真實軌跡挑選之基週軌跡產生之改進方法 Improved Pitch-contour Generation Methods Combing ANN, Global Variance and Real-contour Selection |
指導教授: |
古鴻炎
Hung-Yan Gu |
口試委員: |
王新民
Hsin-Min Wang 余明興 Ming-Shing Yu 范欽雄 Chin-Shyurng Fahn |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 中文 |
論文頁數: | 87 |
中文關鍵詞: | 真實軌跡挑選 、全域變異數 、類神經網路 、離散餘弦轉換係數 、基週軌跡 、變異數比值 |
外文關鍵詞: | variance ratio, real-contour selection |
相關次數: | 點閱:258 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文提出一種結合類神經網路(ANN)、全域變異數(GV)調整與真實基週軌跡挑選之音節基週軌跡產生方法,可用以改善ANN產生之基週軌跡過度平滑的現象,並且可提升合成語音音調的自然度。在模型訓練階段,為了解決音高偵測錯誤的問題,我們分析錯誤之種類,再以程式對錯誤的音高值作更正,然後將各音節的基週軌跡轉換成DCT係數,用以訓練ANN模型、GV參數,此外也把各音節的DCT係數向量作分類儲存。在基週軌跡產生階段,以一個句子的文脈資料作為輸入,先令ANN預測出表示基週軌跡之DCT係數;接著依據GV參數來對各維度的DCT係數作調整,以疏解前述之過度平滑現象;此外,為了進一步提升合成語音的音調自然度,我們再依據GV調整後的DCT向量,到預先分類儲存之真實基週軌跡中進行挑選,以作為最後產生出的音節基週軌跡。關於所提出方法之客觀評估,我們量測了幾種選項設定之下的變異數比值(VR),一般來說,GV調整設的放大係數數值越大,得到的VR值會越高;此外,主觀聽測的結果顯示,以適當的放大係數值去作GV調整,確實可改善音調的自然度,並且加入真實軌跡挑選之步驟,可進一步提升合成語音之音調自然度。
In this thesis, we propose an improved syllable pitch-contour generation method that combines ANN (artificial neural network), global variance and real-contour selection. This method not only alleviates the phenomenon of over-smoothed pitch-contour generated by ANN but also improves the naturalness level of the synthetic pitch contour. In the training stage, the automatically detected pitch contours are checked manually for some types of errors, and then corrected in terms of a program developed here. Next, each syllable pitch contour is transformed into DCT (discrete cosine transform) coefficients. Such DCT coefficients are then used to train ANN model and GV (global variance) parameters, and saved separately according to some context classification modes. In the generation stage, the ANN is used first to predict the DCT coefficients of each syllable pitch-contour according to the inputted contextual information items. Then, the generated DCT coefficients are adjusted by means of GV matching for each DCT vector dimension in order to alleviate the over-smoothing phenomenon mentioned above. Moreover, to promote the naturalness level of the synthetic pitch contours, we base on the DCT vector generated by ANN and adjusted by GV matching to select a real pitch contour from the saved contour pool corresponding to the requested context class. As for the objective assessment of our proposed method, we measure the VRs (variance ratio) under different option setting. It found that the higher VR value will be obtained if the larger weight for GV adjusting is used. In addition, the results of subjective listening tests demonstrate that an appropriate weight value for GV adjusting will improve the naturalness level of the generated pitch contour, and the processing step of real-contour selection will further improve the naturalness level.
[1] I. Satoshi, K. Sumita, and C. Furuichi, “Mel-log Spectrum Approximation (MLSA) Filter for Speech Synthesis”, Transactions of the IECE of Japan, J66-A:122–129, February 1983.
[2] H. Kawahara, A. deCheveign’e, H. Banno, T. Takahashi and T. Irino, “Nearly Defect-free F0 Trajectory Extraction for Expressive Speech Modifications Based on STRAIGHT”, In Interspeech, pp. 537–540, Lisboa, 2005.
[3] Y. Stylinaou, “Harmonic Plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification”, Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.
[4] L. S. Lee, C. Y. Tseng, C. J. Hsieh, "Improved Tone Concatenation Rules in a Formant-based Chinese Text-to-speech System", IEEE Transactions on Speech and Audio Processing, Vol. 1, No. 3, July 1993
[5] H. B. Chiou, H. C. Wang, and Y. C. Chang, “Synthesis of Mandarin Speech Based on Hybrid Concatenation,” Computer Processing of Chinese and Oriental Languages, 5(1), 1991, pp. 217-231.
[6] 曹亦岑,使用小型語料類神經網路之國語語音合成韻律參數產生,國立台灣科技大學電機所碩士論文,1999。
[7] J. R. Bellegarda, Kim E. A. Silverman, K. Lenzo, V. Anderson, “Statistical Prosodic Modeling: From Corpus Design to Parameter Estimation”, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 1, January, 2001
[8] C. H. Wu, J. H. Chen, “Prosody Generation in a Chinese TTS System Based on a Hierarchical Word Prosody Template Tree”, Proceedings of ROCLING X International Conference (Taipei), pp.262-266,1997.
[9] Chen S. H, S. M. Lee, “A Statistical Model Based Fundamental Frequency Synthesizer for Mandarin Speech”, J. Acoust. Soc. Am. 92, No. 1, pp. 114-120, 1992.
[10] 潘能煌,中文文句翻語音系統之音量音調韻律研究,國立中興大學應用數學系碩士論文,1998。
[11] 楊仲捷,基於VQ/HMM之國語語音合成基週軌跡產生之研究,國立台灣科技大學電機所,碩士論文,1999。
[12] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程系碩士論文,2007。
[13] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程系碩士論文,2009。
[14] SPTK Working Group, Speech Signal Processing Toolkit (SPTK), http://sp-tk.sourceforge.net/
[15] M. M. Sondhi, J. Schroeter, “A Hybrid Time-frequency Domain Articulatory Speech Synthesizer”, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-35, No. 7, July, 1987
[16] A. J. Hunt and A. W. Black , “Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database", Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings.
[17] S. Young, G. Evermann, T. Hain, D. Kershaw, G.Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book(for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
[18] H. Zen, K. Tokuda, K. Oura, K. Hashimoto, S. Shiota, S. Takaki, J. Yamagishi, T. Toda, T. Nose, S. Sako, A. W. Black, HMM-based Speech Synthesis System(HTS). http://hts.sp.nitech.ac.jp/
[19] 林祐靖,結合HMM頻譜模型與ANN抖音模型之國與歌聲合成,國立台灣科技大學資訊工程系碩士論文,2014。
[20] 曾聖文,使用頻譜HMM模型及波型包絡模型之曲笛聲合成,國立台灣科技大學資訊工程系碩士論文,2014。
[21] T. Toda, S. Young, “Trajectory Training Considering Global Variance for HMM-based Speech Synthesis”, Acoustics, Speech and Signal Processing, 2009.
[22] S.Imai, “Cepstral Analysis Synthesis on the Mel Frequency Scale”, in Proc.ICASSP-83, pp.93-96, 1983, Boston, Massachusetts, USA.
[23] H. T. Hwang, Y. Tsao, H. M. Wang, Y. R. Wang, S. H. Chen, “Incorporating Global Variance in the Training Phase of GMM-based Voice Conversion”, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013.
[24] 王讚緯,使用直方圖等化及目標音框挑選之語音轉換系統,國立台灣科技大學資訊工程系碩士論文,2014。
[25] M. Reidi, “A Neural-network-based Model of Segmental Duration for Speech Synthesis”, Proc. Eurospeech, Madrid, pp.599-602, 1995.
[26] T. Fukada, Y. Komori, T. Aso and Y. Ohara, “A Study of Pitch Pattern Generation using HMM-based Statistical Information”, in Proc. ICSLP-94, pp.723-726, 1994.
[27] A. Ljolje and F. Fallside, “Synthesis of Natural Sounding Pitch Contours in Isolated Utterances Using Hidden Markov Models”, IEEE Trans. Vol. 34, No. 5, pp. 1074-1080, 1986
[28] C. Traber, “F0 Generation with a Data Base of Natural F0 Patterns and with a Neural Network”, In SSW1-1990, pp. 141-144, 1990.
[29] D. Larreur,F. Emerard,F. Marty, “Linguistic and Prosodic Processing for a Text-to-speech Synthesis System”, Proc. Eurospeech, Paris, 1989
[30] illinois university, HTK, “Forced Alignment”, https://netfiles,uiuc,edu/tyoon/www/ForcedAlignment.htm.
[31] K. Sjolander and J. Beskow, Centre of Speech Technolodge at KTH, http://www.speech.kth.se/wavesurfer/
[32] MathWorks, MATLAB, http://www.mathworks.com/products/matlab/
[33] WikiPedia, “Lagrange polynomial”.
[34] National Programme on Technology Enhanced Learning, http://nptel.ac.in/courses/117104069/
[35] 曹亦岑,使用小型語料類神經網路之國語語音合成韻律參數產生,國立台灣科技大學電機所碩士論文,1999。
[36] S. J. Lee, K. C. Kim, H. Y. Jung, W. Cho, “Application of Fully Recurrent Neural Networks for Speech Recognition”, ICASSP, pp. 77-80, 1991.
[37] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程系碩士論文,2009。
[38] 郭威志,使用語者辯認作前處理之國語TTS系統發展,國立交通大學電信研究所碩士論文,2000。
[39] T. Toda, K. Tokuda, “A Speech Parameter Generation Algorithm Considering Global Variance for HMM-based Speech Synthesis,” IEICE Trans. Inf. & Syst., Vol. E90-D, No. 5 May 2007.
[40] 簡延庭,基於HMM模型之歌聲合成語音色轉換,國立台灣科技大學資訊工程所碩士論文,2013。
[41] E. Godoy, O. Rosec and T. Chonavel, "Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora", IEEE trans. Audio, Speech, and Language Processing, vol. 20, pp. 1313-1323, 2012.
[42] 蔡松峯,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
[43] 洪尉翔,使用MGE訓練之HMM模型及全域變異數匹配之合成語音信號音質改進方法,國立台灣科技大學資訊工程所碩士論文,2015。
[44] 張世穎,結合HTS頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程所碩士論文,2013。
[45] S.H.Chen,S.H.Hwang,andY.R.Wang,“AnRNN-based prosodic information synthesizer for Mandarin text-to-speech,”IEEE trans. Speech and Audio Processing, Vol. 6, 1998, pp. 226-239.
[46] C. T. Lin, R. C. Wu, J. Y. Chang, and S. F. Liang, “A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system”, IEEE trans. Systems, Man, and Cybernetics, Vol. 34, 2004, pp. 309-324.
[47] J. Teutenberg, C. Watson, P. Riddle, “Modeling and Synthesizing f0 Contours with the Discrete Cosine Transform”, IEEE International Conference on ICASSP, 2008.
[48] G. L. Jayavardhana Rama, A. G. Ramakrishnan, R. Muralishankar, P. Prathibha, “A Complete Text-to-Speech Synthesis System in Tamil”, Proceedings of IEEE Workshop on Speech Synthesis, 2002.