研究生: |
王讚緯 Tsan-Wei Wang |
---|---|
論文名稱: |
使用直方圖等化及目標音框挑選之語音轉換系統 A Voice Conversion System Using Histogram Equalization and Target Frame Selection |
指導教授: |
古鴻炎
Hung-Yan Gu |
口試委員: |
范欽雄
Chin-Shyurng Fahn 王家慶 Jia-Ching Wang 王新民 Hsin-Min Wang |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2014 |
畢業學年度: | 102 |
語文別: | 中文 |
論文頁數: | 70 |
中文關鍵詞: | 音段式音框對齊 、直方圖等化 、線性多變量迴歸 、目標音框挑選 、語音轉換 |
外文關鍵詞: | segment-based frame alignment, linear multivariate regression, target frame selection |
相關次數: | 點閱:243 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文採用線性多變量迴歸(linear multivariate regression, LMR)作為頻譜對映機制,並且加入頻譜係數的直方圖等化(histogram equalization, HEQ)及目標音框挑選(target frame selection, TFS)的處理,希望藉以改善高斯混合模型(GMM)頻譜對映常遇到的頻譜過度平滑的問題,以提昇轉換出的語音品質。另外,由於平行語料的取得是不易的,因此我們研究了從非平行語料去建構模擬的平行語料的方法,然後使用非平行語料去實作四種語音轉換系統,分別為LMR系統、LMR+TFS系統、HEQ+LMR系統與HEQ+LMR+TFS系統。在訓練階段我們使用音段式音框對齊,來建造模擬的平行語料,再用模擬的平行語料,去分別訓練四種系統的模型參數。作直方圖等化處理時,需先是把離散倒頻譜係數(DCC)轉換成主成分分析(PCA)係數,再把PCA係數轉換成累積密度函數(CDF)係數;而作目標音框挑選時則是依據一個音框的音段類別編號及LMR對映出的DCC向量,到目標語者相同音段類別所收集到的音框集中,去搜尋出距離較小的目標語者DCC向量,再用以取代原先對映出的DCC向量。在測試階段,由客觀DCC誤差距離量測的結果可以發現,加入直方圖等化的處理可使平均DCC誤差距離減小,但加入目標音框挑選後,反而會使平均DCC誤差變大。不過,依變異數比值(variance ratio, VR)與主觀聽測的結果可知,加入目標音框挑選的確可使語音品質提昇,並且平均DCC誤差距離變大不表示語音品質就會變差。
Is this thesis, linear multivariate regression (LMR) is adopted for spectrum mapping. In addition, histogram equalization (HEQ) of spectral coefficients and target frame selection (TFS) are included to our system. We intend to solve the problem of spectral over-smoothing encountered by the conventional GMM (Gaussian mixture model) based mapping mechanism in order to improve the converted voice quality. Also, we notice that parallel training sentences are hard to prepare. Therefore we study a method to construct an imitative parallel corpus from a nonparallel corpus. Next, we use a nonparallel corpus to build four voice conversion systems: LMR, LMR+TFS, HEQ+LMR and HEQ+TFS. In the training stage, the method, segment-based frame alignment, is refined to construct the imitative parallel corpus. Then, the corpus is used to train the model parameters for the four voice conversion systems respectively. In the module for HEQ, discrete cepstral coefficients (DCC) are first transformed to principle-component-analysis (PCA) coefficients, and then transformed to cumulative-density-function (CDF) coefficient. In the module for TFS, a DCC vector obtained from LMR mapping and its segment-class number are used to search the corresponding frame set consisting of target-speaker frames belonging to the same segment class. Then, the DCC vector of a frame in the frame set that is nearest to the LMR mapped DCC vector is found and taken to replace the mapped DCC vector. In the conversion stage, it is seen that the HEQ module can decreases the average DCC error, but the TFS module causes the average DCC error being increased. However, the TFS module can really improve the converted voice quality according to the measure of variance ratio. Therefore, the increased average DCC error does not indicate the converted voice quality is worsened.
[1] A. de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez and A. J. Rubio, “Histogram equalization of speech representation for robust speech recognition”, IEEE Trans. Speech and Audio processing, vol. 13, no. 3, pp. 355–366, 2005.
[2] D. Erro, A. Moreno and A. Bonafonte, “Voice conversion based on weighted frequency warping”, IEEE Trans. on Audio, Speech, and Language Processing, vol. 18, no. 5, 2010.
[3] D. Erro and A. Moreno, “Frame alignment method for cross-lingual voice conversion”, in Interspeech, pp. 1969–1972, Antwerp, Belgium, 2007.
[4] D. O’Shaughnessy, Speech Communications 2/E, IEEE Press, 2000.
[5] D. Zeng and Y. Yu, “Voice conversion using structured Gaussian mixture model”, ICSP, pp. 541-544, 2010.
[6] E. E. Helander and J. Nurminen, “A novel method for prosody prediction in voice conversion”, ICASSP, vol. 4, pp. 509-512, 2007.
[7] E. Godoy, O. Rosec and T. Chonavel, “Alleviating the one-to-many mapping problem in voice conversion with context-dependent modeling”, in Interspeech, Brighton, U. K., 2009.
[8] E. Godoy, O. Rosec and T. Chonavel, "Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora", IEEE trans. Audio, Speech, and Language Processing, vol. 20, pp. 1313-1323, 2012.
[9] Histogram equalization, http://www.cs.utah.edu/~jfishbau/improc/project2/
[10] H. Valbret, E. Moulines and J. P. Tubach, “Voice transformation using PSOLA technique”, Speech Communication, vol. 11, no. 2–3, pp.175–187, 1992.
[11] H. Y. Gu and S. F. Tsai, “A discrete-cepstrum based spectrum-envelope estimation scheme and its example application of voice transformation”, International Journal of Computational Linguistics and Chinese Language Processing, vol. 14, No. 4, pp.363-382, 2009.
[12] K. Pearson, “On lines and planes of closest fit to systems of points in space”, Philosophical Magazine, vol. 2, No. 6. , pp. 559-572, 1901.
[13] K.Sjolander and J.Beskow, Centre of Speech Technolodgy at KTH, http://www.speech.kth.se/wavesurfer/.
[14] K. Y. Park and H. S. Kim, “Narrowband to wideband conversion of speech using GMM based transformation,” in Proc. ICASSP, vol. 3, pp. 1843–1846, 2000.
[15] M. Abe, S. Nakamura, K. Shikano and H. Kuwabara, “Voice conversion through vector quantization”, in Proc. ICASSP, New York, pp. 565–568, 1988.
[16] M. Narendranath, H. A. Murthy, S. Rajendran and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks”, Speech Communication, vol. 16, pp. 207-216, 1995.
[17] O. Cappe and E. Moulines, “Regularization techniques for discrete cepstrum estimation”, IEEE Signal Processing letters, vol. 3, no. 4, pp.100-102, 1996.
[18] OpenCV 1.0, http://sourceforge.net/projects/opencvlibrary/
[19] S. Young, “The HTK hidden Markov model toolkit: design and philosophy”, Tech Report TR.153, Department of Engineering, Cambridge University (UK), 1993.
[20] T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Perez and Y. Stylianou, “Toward a voice conversion system based on frame selection”, ICASSP, vol. 4, pp. 513-516, 2007.
[21] T. En-Najjary, O. Rosec and T. Chonavel, “A voice conversion method based on joint pitch and spectral envelope transformation”, in Interspeech, Jeju, Korea, pp. 1225-1228, 2004.
[22] T. Toda, Y. Ohtani and K. Shikano, “Eigenvoice conversion based on Gaussian mixture model”, in Interspeech, Pittsburgh, PA, USA , pp. 2446-2449, 2006.
[23] X. Xiao, J. Li, E. S. Chng and H. Li, “Maximum likelihood adaptation of histogram equalization with constraint for robust speech recognition”, in ICASSP, pp. 5480-5483, 2011.
[24] Y. Stylianou, O. Capp’ e and E. Moulines, “Continuous probabilistic transform for voice conversion”, IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.
[25] Y. Stylianou, “Modeling speech based on harmonic plus noise models”, in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp.244-260, 2005.
[26] Y. Stylinaou, “Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification”, Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.
[27] Z. Wu, E. S. Chng and H. Li, “Segment-based frame alignment for text-independent voice conversion”, Technical Report, 2012.
[28] Z. Z. Wu, T. Kinnunen, E. S. Chng and H. Li, “Text-independent F0 transformation with non-parallel data for voice conversion”, in Interspeech, pp. 1732-1735, Makuhari, Chiba, Japan, 2010.
[29] 王小川,語音訊號處理(修訂二版),全華圖書公司,2009。
[30] 古鴻炎,張家維,王讚緯,”以線性多變量迴歸來對映分段後音框之語音轉換方法”,ROCLING,2012。
[31] 古鴻炎,張家維,”基於音段式LMR 對映之語音轉換方法的改進”,ROCLING,2013。
[32] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程所碩士論文,2007。
[33] 張家維,使用主成分向量投影及最小均方對映之語音轉換方法,國立台灣科技大學資訊工程所碩士論文,2012。
[34] 蔡松峯,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
[35] 簡延庭,基於HMM模型之歌聲合成語音色轉換,國立台灣科技大學資訊工程所碩士論文,2013。