研究生: |
簡延庭 Yen-ting Chien |
---|---|
論文名稱: |
基於HMM模型之歌聲合成與音色轉換 HMM Based Singing Voice Synthesis and Timbre Conversion |
指導教授: |
古鴻炎
Hung-yan Gu |
口試委員: |
王新民
Hsin-min Wang 余明興 Ming-shing Yu 洪西進 Shi-jinn Horng |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 中文 |
論文頁數: | 121 |
中文關鍵詞: | 歌聲合成 、隱藏式馬可夫模型 、高斯混合模型 、音色轉換 、離散倒頻譜係數 |
外文關鍵詞: | singing voice synthesis, hidden Markov model, gaussian mixture model, timbre conversion, discrete cepstrum coefficients |
相關次數: | 點閱:226 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文嘗試結合HMM頻譜模型與GMM音色轉換模型,以建造一個具有歌者音色轉換功能之國語歌聲合成系統。在頻譜係數的分析上,我們使用STRAIGHT來求得較準確的頻譜包絡及音高資訊,然後將各音框的頻譜包絡換算成DCC係數。接著我們自行發展程式來訓練HMM頻譜模型及GMM音色轉換模型,然而在合成階段,兩種模型都遇到了頻譜過度平滑的問題,於是我們研究了音段變異數之作法,來調整所產生出的音框頻譜係數,使得過度平滑的頻譜得到改善。關於歌者音色的轉換,我們研究了四種轉換方法,分別是基本音色轉換法、三階GMM轉換法、使用GMM之相對振幅轉換法以及不使用GMM之相對振幅轉換法。在完成系統的製作後,我們使用合成的歌聲檔案來進行聽測實驗,第一項聽測實驗的結果是,歌唱語料訓練的HMM模型比說話語料訓練的HMM模型較能夠合成出有共鳴感的歌聲;此外,音色轉換的聽測結果是,基本音色轉換法所轉換出的歌聲,在音色與聲音品質上,都比其它轉換方法的效果來得好。
In this thesis, we attempts to combine the HMM spectrum model with the timbre conversion model based on GMM to construct a Mandarin singing voice synthesis system supporting the function of singer timbre conversion. In the analysis of spectral coefficients, we use STRAIGHT to obtain more accurate spectral envelopes and pitch information, and the spectral envelope of each frame is converted into DCC (discrete cepstrum coefficients). Next, we develope programs to train the HMM spectrum model and the timbre conversion model based on GMM. In the synthesis stage, both the models encounter the problem, over smoothed spectral envelopes. Therefore, we study the method of segmental variance matching to adjust the generated DCC coefficients. Then, the problem of spectral over-smoothing is alleviated. About the conversion of the singer timbre, we have studied four conversion methods. That is, basic timbre conversion method, third-order GMM conversion method, relative amplitude conversion method with GMM and relative amplitude conversion method without GMM. After the implementation of the system, we use synthetic singing voice files to conduct listening tests. The result of the first run of tests is that the HMM model trained by using the singing corpus can synthesize more resonant singing voice than that trained by using the speaking corpus. In addition, the result of the listening tests for timbre conversion is that the basic timbre conversion method is better than the other methods in timbre similarity and singing voice quality.
[1] C. Dodge, and T.A. Jerse, Computer Music: Synthesis, Composition, and Performance, 2’nd ed., Schirmer Books, 1997.
[2] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne’, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction,” Speech Communication 27, pp. 187–207 , 1999.
[3] HMM-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp/.
[4] HTK, “Forced Alignment,” https://netfiles.uiuc.edu/tyoon/www/ForcedAlignment.htm.
[5] hts_engine API, http://hts-engine.sourceforge.net/.
[6] J. Bonada, X. Serra, “Synthesis of the singing voice by performance sampling and spectral models,” IEEE Signal Processing Magazine March 2007.
[7] J. Sundberg, “Articulatory interpretation of the ‘singing formant’,” J.Acoust.Soc.Am. 55, pp. 838-844, 1974.
[8] K. Saino, H. Zen, Y. Nankaku, A. Lee, K. Tokuda, “An HMM-based Singing Voice Synthesis System,” INTERSPEECH 2006 – ICSLP.
[9] K. Tokuda, H. Zen, and A.W. Black. “An HMM-based speech synthesis system applied to English,”Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, Sep. 2002.
[10] K. Tokuda, T. Masuko, T. Yamada, T. Kobayashi and S. Imai, “An Algorithm for Speech Parameter Generation from Continuous Mixture HMMs with Dynamic Features,” Proc. EUROSPEECH-95, Madrid, Spain, pp. 757-760, 1995.
[11] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. of ICASSP 2000, Istanbul, Turkey, vol. 3, pp. 1315-1318, June 2000.
[12] K.Sjolander and J.Beskow, Centre of Speech Technolodge at KTH, http://www.speech.kth.se/wavesurfer/.
[13] L. Rabiner and B.H. Juang, “Fundamentals of Speech Recognition,” Pretice-Hall International, Inc, 1993.
[14] MATLAB, http://www.mathworks.com/products/matlab/.
[15] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book( for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
[16] S.Imai, “Cepstral analysis synthesis on the mel frequency scale,” in Proc. ICASSP-83, Boston, Massachusetts, USA, pp. 93-96, 1983.
[17] Sinsy, “HMM-based Singing Voice Synthesis System,” http://www.sinsy.jp/.
[18] T. F. Cleveland, J. Sundberg and R. E. Stone, “Long-term-average spectrum characteristics of country singers during speaking and singing,” J. Voice, Vol. 15, pp. 54-60, 2001.
[19] T. Toda, A.W. Black, K. Tokuda, “Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, No. 8, pp. 2222-2235, Nov. 2007.
[20] T. Toda and K. Tokuda, “A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis,” IEICE Trans. Inf. & Syst., Vol. E90-D, No. 5 May 2007.
[21] Y. Shiga and S. King, “Estimating detailed spectral envelopes using articulatory clustering,” Int. Conference on Spoken Language Processing (ICSLP2004), Jeju, Korea, October 2004.
[22] Y. Stylianou, “Applying the Harmonic plus Noise Model in Concatenative Speech Synthesis,” IEEE Trans. Speech and Audio Processing, Vol. 9, No. 1, pp. 21-29, 2001.
[23] Y. Stylianou, “Modeling speech based on harmonic plus noise models”, in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp. 244-260, 2005.
[24] Y.Stylinaou, “Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification”, Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.
[25] Yamaha, VOCALOID, New Singing Synthesis Technology, http://www.vocaloid.com/en/.
[26] 王如江,基於歌聲表情分析與單元選擇之國語歌聲合成研究,國立台灣科技大學資訊工程研究所碩士論文,2007。
[27] 古鴻炎、蔡松峰,“基於離散倒頻譜之頻譜包絡估計架構及其於語音轉換之應用”,第二十一屆自然語言與語音處理研討會(ROCLING 2009),台中,第151-164頁,2009。
[28] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程研究所碩士論文,2007。
[29] 林正甫,使用ANN抖音參數模型之國語歌聲合成,國立台灣科技大學資訊工程研究所碩士論文,2007。
[30] 林政源,應用於中文語音與歌聲合成之自動切音研究,國立清華大學資訊工程研究所碩士論文,新竹,2007。
[31] 校園民歌回顧,一品文化出版,台北,1985。
[32] 陳安璿,整合MIDI伴奏之歌唱聲合成系統,國立台灣科技大學資訊工程研究所碩士論文,台北,2004。
[33] 華堃,歌唱聲以及樂器聲合成改進之研究,國立台灣科技大學資訊工程研究所碩士論文,台北,2011。
[34] 黃國勛,行動裝置上語音命令辨識之研究,國立台灣科技大學資訊工程研究所碩士論文,2007。
[35] 廖皇量,國語歌聲合成信號品質改進之研究,國立台灣科技大學資訊工程研究所碩士論文,2006。
[36] 蔡松峰,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程研究所碩士論文,2009。
[37] 賴名彥,結合 HMM頻譜模型與 ANN韻律模型之國語語音合成系統, 國立台灣科技大學資訊工程研究所碩士論文,2009。