研究生: |
張家維 Chia-wei Chang |
---|---|
論文名稱: |
使用主成分向量投影及最小均方對映之語音轉換方法 A Voice Conversion Method Using PCA Vector Projection and LMS Mapping |
指導教授: |
古鴻炎
Hung-yan Gu |
口試委員: |
林伯慎
Bor-shen Lin 王新民 Hsin-min Wang 余明興 Min-shin Yu |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 中文 |
論文頁數: | 112 |
中文關鍵詞: | 主成分分析 、主成分向量投影 、直方圖等化 、最小均方法 、語音轉換 |
外文關鍵詞: | principal component analysis, eigenvector projection, histogram equalization, least mean-square, voice conversion |
相關次數: | 點閱:224 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
為了避免傳統GMM頻譜對映機制所造成的轉換出頻譜過度平滑的問題,我們研究、提出了二種新的語音轉換方法,其一是結合直方圖等化(HEQ)和最小均方(LMS)之對映方法,其二是結合主成分向量投影(EVP)和LMS對映之方法,而另一種陪襯的方法是,直接拿主成分分析(PCA)得到的PCA係數去作LMS對映。依據這三種轉換方法,我們分別建造了Baseline語音轉換系統、HEQ語音轉換系統以及EVP語音轉換系統。至於三個系統共同的處理步驟是,我們採取了語音單元自動切割及分類的作法,以緩和一對多對應的問題,因此在訓練階段,對於各個音素類別收集到的語音單元,就分別作主成分分析,再依主成分向量,將離散倒頻譜(DCC)係數轉換成PCA係數。比較Baseline系統和傳統GMM系統,不論依據客觀的距離量測(ODM)或主觀的聽覺測試(SLT),Baseline系統都獲得了改進。此外,比較Baseline系統、HEQ系統和EVP系統,在ODM方面,EVP系統的平均距離最小;在SLT的音色相似度測試方面,三個系統都有不錯的效果,在語音品質方面則是EVP系統的最佳。
To avoid the phenomenon of spectral over-smoothing resulted from the conventional GMM (Gaussian mixture model) based mapping mechanism, we proposed two novel voice conversion methods. One method is a combination of histogram equalization (HEQ) and least mean-square (LMS) mapping. The other method is a combination of eigenvector projection (EVP) and the LMS mapping. In addition, a foil method studied is to directly take PCA (principal component analysis) coefficients for the LMS mapping. According to the three methods, we have built Baseline voice conversion system, HEQ voice conversion system and EVP voice conversion system, respectively. To relieve the one-to-many mapping problem, we have included a few common processing steps to the three systems. That is, speech units are automatically segmented and classified. In the training stage, we collected speech units belonging to a same phoneme category to perform PCA. Then, DCC coefficients are converted to PCA coefficients in terms of the principal component vectors. According to the results of objective distance measurements (ODM) and subjective listening tests (SLT), the Baseline system obtains better performance than the conventional GMM based system. In addition, the three systems, Baseline system、HEQ system and EVP system, have been compared. The EVP system obtains the minimum average distance in ODM. In SLT for timbre similarity, the three systems have comparable voice-timbre similarity for their converted voices. As to SLT for voice quality, the EVP system obtains the best performance in its converted voice.
[1] D. Erro, A. Moreno, A. Bonafonte, 「Voice Conversion Based on Weighted Frequency Warping」, IEEE Transactions on Audio, Speech, and Language processing, vol. 18, no. 5, July 2010.
[2] K. Y. Park, H. S. Kim, 「Narrowband to wideband conversion of speech using GMM based transformation,」 in Proc. ICASSP, vol. 3, pp. 1843–1846, 2000.
[3] D. O'Shaughnessy, Speech Communications 2/E, IEEE Press, 2000.
[4] O. Cappe, E. Moulines, 「Regularization Techniques for Discrete Cepstrum Estimation」, IEEE Signal processing letters, vol. 3, no. 4, pp.100-102, April 1996.
[5] En-Najjary, Taoufik, Rosec, Olivier, Chonavel, Thierry, 「A voice conversion method based on joint pitch and spectral envelope transformation」, in Interspeech, Jeju, Korea, pp. 1225-1228, 2004.
[6] 蔡松峰,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
[7] 楊志民,基於機率主成分分析之強健性語音辨認,國立台北科技大學電腦與通訊研究所碩士論文,2009。
[8] 李上銘,語音辨認中基於主成份分析之進一步技術,國立臺灣大學電信工程學研究所,2001。
[9] Shang-nien Tsai and Lin-shun Lee, 「Improved robust features for speech recognition by Histogram Equalization (HEQ) and integrating Time-Frequency Principal Components (TFPC)」, in ASRU, St. Thomas, U.S. Virgin Islands, 2003.
[10] T. Toda, Y. Ohtani, K. Shikano, 「Eigenvoice Conversion Based on Gaussian Mixture Model」,in ICSLP, pp. 2446-2449, 2006.
[11] Xiong Xiao, Jinyu Li, Eng Siong Chng, Haizhou Li, 「Maximum likelihood adaptation of histogram equalization with constraint for robust speech recognition」, in ICASSP, pp. 5480-5483, 2011.
[12] de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez, A. J. Rubio, 「Histogram equalization of speech representation for robust speech recognition」, IEEE Trans. Speech and Audio processing, vol. 13, no. 3, pp. 355–366, 2005.
[13] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, 「Voice conversion through vector quantization」, in Proc. ICASSP, New York, pp. 565–568, Apr. 1988.
[14] H. Valbret, E. Moulines, J. P. Tubach, 「Voice transformation using PSOLA technique」, Speech Communication, vol. 11, no. 2–3, pp.175–187, 1992.
[15] M. Narendranath, H. A. Murthy, S. Rajendran, B. Yegnanarayana, 「Transformation of formants for voice conversion using artificial neural networks」,」Speech Communication, vol. 16, pp. 207-216, Feb. 1995.
[16] Y. Stylianou, O. Capp' e, E. Moulines, 「Continuous probabilistic transform for voice conversion」, IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.
[17] S. Desai, E. Raghavendra, B. Yegnanarayana, A. Black, K. Prahallad, 「Voice conversion using artificial neural networks」, in IEEE WSLT, 2008.
[18] M. Larbi, B. Vincent, B. Olivier, 「Comparing GMM-based speech transformation systems」, in Interspeech, Antwerp, Belgium, pp. 1989-1992, 2007.
[19] E. Godoy, O. Rosec, T. Chonavel, 「Alleviating the One-to-Many Mapping Problem in Voice Conversion with Context-Dependent Modeling」, in Interspeech, Brighton, U. K., 2009.
[20] A. F. Machado, M. Queiroz, 「Techniques for Crosslingual Voice Conversion」, Proceedings of the IEEE International Symposium on Multimedia, 2010.
[21] 維基百科,直方圖均衡化,http://zh.wikipedia.org/wiki/直方圖均衡化
[22] 王小川,語音訊號處理(修訂二版),全華圖書公司,2009。
[23] K. Pearson, 「On lines and planes of closest fit to systems of points in space」, Philosophical Magazine, pp. 559–572, 1901.
[24] H. Hotelling, 「Analysis of a complex of statistical variables into principal components」, Journal of Educational Psychology, vol 24(6), pp. 417-441, Sep. 1933.
[25] H. Kawahara, 「STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds」, Acoustical Science and Technology, vol. 27, no. 6, pp.349-353, 2006.
[26] S. Young, 「The HTK Hidden Markov Model Toolkit: Design and Philosophy」, Tech Report TR.153, Department of Engineering, Cambridge University (UK), 1993.
[27] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程研究所碩士論文,2007。
[28] Wikipedia, 「Least squares」, http://en.wikipedia.org/wiki/Least_squares
[29] Y. Stylianou, 「Modeling speech based on harmonic plus noise models」, in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp.244-260, 2005.
[30] Y. Stylinaou, 「Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification」, Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.
[31] OpenCV, http://sourceforge.net/projects/opencvlibrary/.
[32] OpenCV統計應用-PCA主成分分析, http://yester-place.blogspot.com/2009/01/opencv-pca.html
[33] 張智星,"音訊處理與辨識",網路線上課程,可由作者之網頁 http://www.cs.nthu.edu.tw/~jang連結到此線上課程。