簡易檢索 / 詳目顯示

研究生: 林祐靖
Yu-ching Lin
論文名稱: 結合HMM頻譜模型與ANN抖音模型之國語歌聲合成
Mandarin Singing Voice Synthesis Combining HMM Spectrum Models and ANN Vibrato Models
指導教授: 古鴻炎
Hung-yan Gu
口試委員: 余明興
Ming-sing,Yu
廖元甫
Yuan-fu,Liao
鍾國亮
Guo-liang,Zhong
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 中文
論文頁數: 77
中文關鍵詞: 隱藏式馬可夫模型國語歌聲合成類神經網路抖音
外文關鍵詞: Mandarin, Singing Voice Synthesis, Vibrato, HTS
相關次數: 點閱:245下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文嘗試結合HTS的頻譜HMM模型與ANN抖音模型,來建造一個顯著提升自然度之國語歌聲合成系統。在訓練階段,使用STRAIGHT去分析各音框的基頻,接著設計問題集,以使用HTS軟體來訓練音高HMM及訓練頻譜HMM模型與決策樹。在合成階段,先令HTS engine合成出初始歌聲,再令ANN抖音模型依時長資訊去產生各音節具有抖音特性的音高軌跡,然後取代HTS原本產生的音高軌跡,如此就可讓HTS合成出音高正確且具有抖音的歌聲。此外,我們也考慮了音符滿度的設定,調整合成歌聲的振幅,以消除刺耳的聲音。對於合成的歌聲,我們量測了MGC頻譜係數的誤差,並且進行主觀聽測實驗,結果顯示,修改音高後的合成歌聲,其自然度會比HTS原本合成的歌聲提高很多。


    In this thesis, a Mandarin singing voice synthesis system is constructed by combining HTS (HMM-based speech synthesis system) trained HMM spectrum models and ANN (artificial neural network) vibrato models. This system is intended to promote the naturalness level of the synthesized singing voice. In the training stage, STRAIGHT is used to analyze the fundamental frequency of each signal frame. Then, we design question sets for the software, HTS, to train HMM fundamental frequency models, HMM spectrum models and decision trees. In the synthesis stage, we first command the HTS engine to synthesize an initial singing voice. Then, the HMM state-staying durations are sent to the ANN vibrato models to generate a natural pitch contour for each lyric syllable. Next, the pitch contour generated by HTS are replaced so that HTS can be enforced to synthesize a new singing voice that not only has correct melody but also has vibrato characteristic. In addition, we consider the occupying rate of a note’s duration, adjust the amplitude of the synthetic singing voice to decrease harsh noise. As to the quality of the synthetic singing voice, average spectral error in terms of MGC (mel generalized cepstrum) coefficients is measured, and listening tests are conducted. The results show that the synthetic singing voice with pitch-replacing is better in naturalness level than the original synthetic singing voice by HTS.

    摘要 ABSTRACT 目錄 圖表索引 第1章 緒論 1.1 研究動機及目的 1.2 文獻回顧 1.2.1 使用HMM (hidden Markov model)之作法 1.2.2 非HMM之作法 1.2.3 歌聲表情因素 1.3 研究方法 1.4 論文架構 第2章 語料準備、標音 2.1 語料錄音 2.2 標音、切音 2.3 STRAIHT分析 2.4 平均音高計算 第3章 HMM頻譜模型之建造 3.1 HTS歌聲合成軟體 3.2 隱藏式馬可夫模型簡介 3.3 HTS參數設定與輸入檔案準備 3.3.1 國語歌聲單元設定 3.3.2 HTS參數檔案設定 3.3.3 HTS輸入檔案 3.4 特徵係數萃取 3.5 文脈無關之HMM模型 3.6 粗糙的文脈相依HMM模型 3.7 HMM樹狀分群與決策樹 3.7.1 問題集 3.7.2 決策樹 3.8 HMM訓練及第二次分群 3.9 文脈相依的標籤檔格式 第4章 抖音音高軌跡產生 4.1 HTS音高軌跡 4.2 抖音參數分析 4.3 類神經網路模型 4.3.1 類神經網路結構 4.3.2 輸入參數 4.3.3 輸出參數 4.4 音高軌跡的產生 第5章 歌聲信號合成 5.1 歌聲信號合成之流程 5.2 音樂性參數 5.2.1 歌譜格式與轉換 5.2.2 聲母音長比例 5.2.3 滿度處理 5.2.4 轉音處理 5.2.5 歌聲振幅調整 5.3 HTS engine軟體之修改 第6章 歌聲效能評估 6.1 GV參數之影響 6.2 客觀量測 6.3 歌聲自然度聽測實驗(MOS測試) 6.4 歌聲自然度聽測實驗(AB測試) 第7章 結論 參考文獻 附錄A HTS_pstream.c程式碼修改處 A.1 關於length.txt檔的修改處 A.2 關於reall.txt檔的修改處

    [1] B. Boashash, “Estimating and interpreting the instantaneous frequency of a signal, Part I: Fundamentals”, Proceedings of the IEEE, Vol. 80, pp. 519-538, April 1992.
    [2] B. Boashash, “Estimating and interpreting the instantaneous frequency of a signal. Part 2: Algorithms and applications”, Proceedings of the IEEE, Vol. 80, pp. 539-568, April 1992.
    [3] C. Dodge, and T. A. Jerse, Computer Music: Synthesis, composition, and performance, second ed., Schirmer Books, 1997.
    [4] E. Prame, “Vibrato extent and intonation in professional western lyric singing”, J. Acoust. Soc. Am., Vol. 102, pp. 616-621, 1997.
    [5] E. Prame, “Measurements of the vibrato rate of ten singers”, J. Acoust. Soc. Am., Vol. 96, pp. 1979-1984, 1994.
    [6] E. Prame, “Vibrato extent and intonation in professional western lyric singing”, J. Acoust. Soc. Am., Vol. 102, pp. 616-621, 1997.
    [7] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne’, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction,” Speech Communication 27, pp. 187–207, 1999.
    [8] H. Zen, K. Tokuda, K. Oura, K. Hashimoto, S. Shiota, S. Takaki, J. Yamagishi, T. Toda, T. Nose, S. Sako, Alan W. Black, HMM-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp/
    [9] HTK, “Forced Alignment”, https://netfiles.uiuc.edu/tyoon/www/ForcedAlignment.htm.
    [10] I. Arroabarren, et al.,“Measurement of vibrato in lyric singers”, IEEE instrumentation and measurement technology conference, pp. 1529-1534, 2001.
    [11] J. C. Brown, and K. V. VAUGHN, “Pitch center of stringed instrument vibrato tones“, J. Acoust. Soc. Am. 100, 1728–1735. 1996.
    [12] J. Bonada, X. Serra, “Synthesis of the singing voice by performance sampling and spectral models,” IEEE Signal Processing Magazine March 2007.
    [13] J. Schoukens, R. Pintelon, and H. Van Hamme,“The interpolated fast fourier transform: A comparative study”, IEEE trans. Instrum. Meas., Vol. 41, pp. 226-232, April 1992.
    [14] J. Sundberg, “Effects of the vibrato and the ‘singing formant’ on pitch”, Musica Slovaca VI, 1978, Bratislava, 51–69; also J. Res. Singing 5(2), 5–17. 1978.
    [15] J. Sundberg, E. Prame, and J. Iwarsson, “Replicability and Accuracy of Pitch Patterns in Professional Singers“, in Vocal Fold Physiology, edited by P. J. Davis and N. H. Fletcher (Singular, San Diego), 1996.
    [16] K. Tokuda, H. Zen, and A.W. Black. “An hmm-based speech synthesis system applied to english,”Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, Sep. 2002.
    [17] K. Kato, et al., “Blending vocal music with the sound field - the effective duration of autocorrelation function of western professional singing voices with different vowels and pitches”, International Symposium on Musical Acoustics (ISMA2004), Nara, Japan, 2004.
    [18] K. Saino, H. Zen, Y. Nankaku, A. Lee, K. Tokuda, “An HMM-based Singing Voice Synthesis System,” INTERSPEECH 2006 – ICSLP.
    [19] K. Sjolander and J. Beskow, Centre of Speech Technolodge at KTH, http://www.speech.kth.se/wavesurfer/.
    [20] MATLAB, http://www.mathworks.com/products/matlab/.
    [21] P. Howes, et al.,“The relationship between measured vibrato characteristics and perception in western operatic singing”, Journal of Voice, Vol. 18, pp. 216-230, 1997.
    [22] Park, Younsung, Sungrack Yun, and Chang Dong Yoo. "Parametric emotional singing voice synthesis." Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010.
    [23] Rabiner, Lawrence, and B. Juang. "An introduction to hidden Markov models." ASSP Magazine, IEEE 3.1 (1986): 4-16.
    [24] S. Imaizumi, H. Saida, Y. Shimura and H. Hirose, “Harmonic analysis of the singing voice:—Acoustic characteristics of vibrato“, in Proceedings of the Stockholm Music Acoustics Conference (SMAC93) Royal Swedish Academy of Music, Stockholm, pp. 197–200. 1994.
    [25] S. W. Lee, Shen Ting Ang, Minghui Dong, and Haizhou Li, "Generalized F0 modelling with absolute and relative pitch features for singing voice synthesis", Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012.
    [26] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book( for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
    [27] Shinoda, Koichi, and Takao Watanabe. "MDL-based context-dependent subword modeling for speech recognition." JOURNAL-ACOUSTICAL SOCIETY OF JAPAN-ENGLISH EDITION- 21.2 (2000): 79-86.
    [28] Shinoda, Koichi, and Takao Watanabe. "Speaker adaptation with autonomous model complexity control by MDL principle." Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on. Vol. 2. IEEE, 1996.
    [29] Shonle, J. I., and Horan, K. E. “The pitch of vibrato tones“, J. Acoust. Soc. Am. 67, 246–252. 1980.
    [30] Sinsy, “HMM-based Singing Voice Synthesis System,” http://www.sinsy.jp/.
    [31] SPTK Working Group, Speech Signal Processing Toolkit (SPTK), http://sp-tk.sourceforge.net/
    [32] T. Toda and K. Tokuda, "A speech parameter generation algorithm considering global variance for HMM-based speech synthesis." IEICE TRANSACTIONS on Information and Systems 90.5 (2007): 816-824.
    [33] Tokuda, et al. "Mel-generalized cepstral analysis-a unified approach to speech spectral estimation." ICSLP. Vol. 94. 1994.
    [34] Y. Horii, “Acoustic analysis of vocal vibrato: a theoretical interpretation of data“, J. Voice 3, 36–43. 1989.
    [35] Yamaha, VOCALOID, New Singing Synthesis Technology, http://www.vocaloid.com/en/.
    [36] 王如江,基於歌聲表情分析與單元選擇之國語歌聲合成研究,國立台灣科技大學資訊工程研究所碩士論文,2007.
    [37] 林正甫,使用ANN抖音參數模型之國語歌聲合成,國立台灣科技大學資訊工程研究所碩士論文,2008.
    [38] 陳安璿,整合MIDI伴奏之歌唱聲合成系統,國立台灣科技大學資訊工程研究所碩士論文,台北,2004.
    [39] 校園民歌回顧,一品文化出版,台北,1985.
    [40] 張世穎,結合HTS頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程研究所碩士論文,台北,2013.
    [41] 葉怡成,類神經網路模式應用與實作, 儒林圖書公司, 2006.
    [42] 華堃,歌唱聲以及樂器聲合成改進之研究,國立台灣科技大學資訊工程研究所碩士論文,台北,2011.
    [43] 廖皇量,國語歌聲合成信號品質改進之研究,國立台灣科技大學資訊工程研究所碩士論文,2006.
    [44] 簡延庭,基於 HMM 模型之歌聲合成與音色轉換,國立台灣科技大學資訊工程研究所碩士論文,2013.

    QR CODE