研究生: |
藍子杰 Tzu-Chieh - Lan |
---|---|
論文名稱: |
使用半音節單元挑選與相位聲碼器信號模型之歌聲合成 Singing Voice Synthesis Using Demi-syllable Unit Selection and Phase Vocoder Based Signal Model |
指導教授: |
古鴻炎
Hung-Yan Gu |
口試委員: |
王新民
Hsin-Min Wang 鍾國亮 Kuo-Liang Chung 林柏慎 Bor-Shen Lin |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 116 |
中文關鍵詞: | 歌聲合成 、單元挑選 、相位聲碼器 |
外文關鍵詞: | Singing Voice Synthesis, Unit Selection, Phase Vocoder |
相關次數: | 點閱:318 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
除了經由單元挑選來提升合成歌聲的自然度,本論文把重點放於合成歌聲音質與音色的維持,即希望讓合成歌聲的音色盡量靠近原唱者的音色,我們選擇以研究PV技術之信號分析與合成來達成目標。為了在升降歌聲單元的音高時保持音色的一致性,我們改採DCC頻譜包絡來對各通道之PV參數作連動調整;此外,前人方法分析出的DCC係數與PV參數不是完全正確,因此我們研究了DCC係數與PV參數之平滑處理方法。在合成階段,本論文沿用前人基於動態規劃演算法之單元挑選模組,但修正了一些費用函數與權重值的缺失,以選出最適合的歌聲單元序列,然後參考DCC頻譜包絡與基頻值去作PV參數之調整,再以加法式弦波合成法來產生各音節的歌聲信號。使用合成出的歌聲音檔,我們進行了音色相似度、信號品質與自然度的聽測實驗,由音色相似度的平均評分可發現,我們研究的PV技術在音色相似度方面是優於HNM技術的;在信號品質方面,平均評分出現分歧且標準差偏高,所以較難評斷出PV與HNM那一種技術較為優良;在自然度方面,所合成歌曲的平均評分都可達到4分以上,表示我們結合半音節單元挑選與PV信號模型之歌聲合成方法確實可獲得不錯的自然度。
In addition to promoting the naturalness level of synthetic singing voices through unit selection, this thesis aims to keep the signal quality and timbre of the synthesized singing voices. That is, we hope the timbre of the synthetic songs is as close as possible to the original singer. We study PV(Phase Vocoder) based signal analysis and synthesis methods to achieve the goal mentioned.In order to keeptimbre consistency when a voice unit ispitch shifted,we adopt DCC(Discrete Cepstral Coefficient) derivedspectral envelope to adjust the parameters of each PV channel.As other factors to be considered, the DCC and PV parametersanalyzed with the methods developed by other researchers are not entirely correct. Therefore, we study some smoothing methods to correct DCC and PV parameters. In the synthesis stage, we use the program module developed by another researcher to perform dynamic programming based unit selection.Here, a few cost functions are revised and some weight values are changed in order to have a more suitable unit sequence be selected. Then, DCC derived spectral envelopes and ANN generated fundamental frequencies are used to adjust PV parameters. Next, a sequence of PV parameter frames are fed to an additivesinusoidal model to generatethe signal samples of each syllable. By using the synthesized singing voices, we conduct three types of listening tests including timbre similarity test, signal quality test and naturalness level test. Accordingto the average scores of timbre similarity, the PV based signal model is superior to the HNM (harmonic plus noise model)model. In the tests of signal quality, the average scoresappearto be inconsistent between different songs and the standard deviationsarelarge.Therefore, it cannot be assert whether the PV signal model is better.In the tests of naturalness level, the average scores of our synthetic songs can reach 4 and more points, which meansthe singing voice synthesismethod, combining demi-syllable unit selection and PV based signal model, can obtain a much improved naturalness level.
[1] F. R. Moore. Elements of ComputerMusic. Prentice Hall, Englewood Cliffs, second edition, 1990..
[2] H.Kawahara,O.Masuda-Katsuse and A.de Cheveigne,”Restructruing speech instantaneous-frequency based F0 extraction”,Speech Commun- ication 27 , pp.187-207 , 1999.
[3] HTK, “Forced Alignment,” https://netfiles.uiuc.edu/tyoon/www/ForcedAlignment.htm.
[4] J. Bonada and X. Serra, “Synthesis of the singing voice by performance sampling and spectral models,” IEEE Signal Processing Magazine March 2007.
[5] K. Saino, H. Zen, Y. Nankaku, A. Lee and K. Tokuda, “An HMM-based Singing Voice Synthesis System,” INTERSPEECH – ICSLP, Pittsburgh, PA, USA, September 17-21, 2006
[6] K.Sjölander and J.Beskow, Centre of Speech Technolodge at KTH, http://www.speech.kth.se/wavesurfer/.
[7] M. Umbert, J. Bonada and M. Blaauw, ”Generating Singing Voice Expression Contours Based On Unit Selection”,SMAC,Barcelona, Spain2013.
[8] M. Dolson, "The phase vocoder: A tutorial," Computer Music Journal, vol. 10, no. 4, pp. 14 - 27, 1986.
[9] E. Moulinesand J. Laroche. "Non-parametric techniques for pitch-scale and time-scale modification of speech." Speech Communication, vol. 16, pp.175-205, Feb. 1995.
[10] N.P. Narendra and K.S. Rao,”Optimal weight tuning method for unit selection cost functions in syllable basedtext-to-speech synthesis,” School of Information Technology, Indian Institute of Technology Kharagpur 2013.
[11] M. Portnoff."Time-scale modification of speech based on short-time Fourier analysis." IEEE Transactions on Acoustics, Speech, and Signal Processing 29.3 (1981): 374-390.
[12] Sinsy, “HMM-based Singing Voice Synthesis System,” http://www.sinsy.jp/.
[13] S.Imai, “Cepstral analysis synthesis on the mel frequency scale,” in Proc. ICASSP-83, Boston, Massachusetts, USA, pp. 93-96, 1983.
[14] S. Seneff, "System to independently modify excitation and/or spectrum of speech waveform without explicit pitch extraction." IEEE Transactions on Acoustics, Speech, and Signal Processing Vol.ASSP-24, pp358-365,1982.
[15] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland, The HTK Book( for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
[16] Z.Inanoglu and S.Young,”Emotion Conversion using F0 Segment Selection,”INTERSPEECH,Cambridge, UK, 2008.
[17] 校園民歌回顧,一品文化出版,台北,1985。
[18] 陳安璿,整合MIDI伴奏之歌唱聲合成系統,國立台灣科技大學資訊工程研究所碩士論文,台北,2004。
[19] 廖皇量,國語歌聲合成信號品質改進之研究,國立台灣科技大學資訊工程研究所碩士論文,台北,2006。
[20] 王如江,基於歌聲表情分析與單元選擇之國語歌聲合成研究,國立台灣科技大學資訊工程所碩士論文,2007。
[21] 林正甫,使用ANN抖音參數模型之國語歌聲合成,國立台灣科技大學資訊工程研究所碩士論文,台北,2008。
[22] 王佑邦,相角音碼器及其在語音韻律調整之應用,國立台灣大學電信工程研究所碩士論文,台北,2008。
[23] 古鴻炎、蔡松峰,“基於離散倒頻譜之頻譜包絡估計架構及其於語音轉換之應用”,第二十一屆自然語言與語音處理研討會(ROCLING 2009),台中,第151-164頁,2009。
[24] 華,歌唱聲以及樂器聲合成改進之研究,國立台灣科技大學資訊工程研究所碩士論文,台北,2011。
[25] 鄧愷,用於單音人聲和複音音樂的抖音偵測,國立清華大學研究所碩士論文,2013。
[26] 簡延庭,基於HMM模型之歌聲合成與音色轉換,國立台灣科技大學資訊工程研究所碩士論文,2013。
[27] 王讚緯,使用直方圖等話及目標音框挑選之語音轉換系統,國立台灣大學資訊工程研究所碩士論文,2014。
[28] 何嘉康,使用半音節單元挑選及HNM信號模型之國語歌聲合成,2016。