簡易檢索 / 詳目顯示

研究生: 張家維
Chia-wei Chang
論文名稱: 使用主成分向量投影及最小均方對映之語音轉換方法
A Voice Conversion Method Using PCA Vector Projection and LMS Mapping
指導教授: 古鴻炎
Hung-yan Gu
口試委員: 林伯慎
Bor-shen Lin
王新民
Hsin-min Wang
余明興
Min-shin Yu
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 112
中文關鍵詞: 主成分分析主成分向量投影直方圖等化最小均方法語音轉換
外文關鍵詞: principal component analysis, eigenvector projection, histogram equalization, least mean-square, voice conversion
相關次數: 點閱:224下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

為了避免傳統GMM頻譜對映機制所造成的轉換出頻譜過度平滑的問題,我們研究、提出了二種新的語音轉換方法,其一是結合直方圖等化(HEQ)和最小均方(LMS)之對映方法,其二是結合主成分向量投影(EVP)和LMS對映之方法,而另一種陪襯的方法是,直接拿主成分分析(PCA)得到的PCA係數去作LMS對映。依據這三種轉換方法,我們分別建造了Baseline語音轉換系統、HEQ語音轉換系統以及EVP語音轉換系統。至於三個系統共同的處理步驟是,我們採取了語音單元自動切割及分類的作法,以緩和一對多對應的問題,因此在訓練階段,對於各個音素類別收集到的語音單元,就分別作主成分分析,再依主成分向量,將離散倒頻譜(DCC)係數轉換成PCA係數。比較Baseline系統和傳統GMM系統,不論依據客觀的距離量測(ODM)或主觀的聽覺測試(SLT),Baseline系統都獲得了改進。此外,比較Baseline系統、HEQ系統和EVP系統,在ODM方面,EVP系統的平均距離最小;在SLT的音色相似度測試方面,三個系統都有不錯的效果,在語音品質方面則是EVP系統的最佳。


To avoid the phenomenon of spectral over-smoothing resulted from the conventional GMM (Gaussian mixture model) based mapping mechanism, we proposed two novel voice conversion methods. One method is a combination of histogram equalization (HEQ) and least mean-square (LMS) mapping. The other method is a combination of eigenvector projection (EVP) and the LMS mapping. In addition, a foil method studied is to directly take PCA (principal component analysis) coefficients for the LMS mapping. According to the three methods, we have built Baseline voice conversion system, HEQ voice conversion system and EVP voice conversion system, respectively. To relieve the one-to-many mapping problem, we have included a few common processing steps to the three systems. That is, speech units are automatically segmented and classified. In the training stage, we collected speech units belonging to a same phoneme category to perform PCA. Then, DCC coefficients are converted to PCA coefficients in terms of the principal component vectors. According to the results of objective distance measurements (ODM) and subjective listening tests (SLT), the Baseline system obtains better performance than the conventional GMM based system. In addition, the three systems, Baseline system、HEQ system and EVP system, have been compared. The EVP system obtains the minimum average distance in ODM. In SLT for timbre similarity, the three systems have comparable voice-timbre similarity for their converted voices. As to SLT for voice quality, the EVP system obtains the best performance in its converted voice.

摘要 I ABSTRACT II 誌謝 III 目錄 IV 圖表索引 VII 第1章 緒論 1 1.1 研究動機 1 1.2 文獻回顧 2 1.2.1 頻譜特徵參數 2 1.2.2 主成分分析 3 1.2.3 直方圖等化 4 1.2.4 音色的轉換方法 4 1.3 研究方法 6 1.3.1 Baseline語音轉換系統 7 1.3.2 HEQ語音轉換系統 11 1.3.3 EVP(eigenvector projection) 語音轉換系統 14 1.4 論文架構 18 第2章 語料準備與前處理 19 2.1 語料錄音 19 2.2 語音分類 19 2.2.1 音素分類與聲、韻母分類 20 2.2.2 基於HMM 狀態之分類 23 2.2.3 標音、切音 23 2.3 DTW音框匹配 26 第3章 DCC係數計算及主成分分析 28 3.1 DCC係數計算 28 3.2 主成分分析簡介 30 3.3 語音信號主成分向量計算 31 3.4 主成分係數轉換 33 3.5 DCC係數還原 33 3.6 降維試驗 34 第4章 最小均方法LMS 與直方圖等化HEQ 39 4.1 LMS簡介 39 4.2 LMS參數矩陣之訓練 40 4.3 LMS轉換 42 4.4 LMS修正 43 4.5 HEQ簡介 43 4.6 HEQ分析 44 4.6.1 HEQ表格 47 4.5.2 CDF係數轉換 49 4.5.3 PCA係數還原 50 第5章 主成分向量投影EVP 51 5.1 EVP簡介 51 5.2 EVP轉換 54 5.3 EVP-diagonal 56 5.4 EVP-threshold 58 5.5 EVP與LMS結合 61 第6章 語音轉換實驗 63 6.1 訓練語料以及測試語料 63 6.2 距離量測方式 63 6.3 實驗一:語音分類方式 65 6.4 實驗二:與傳統GMM語音轉換法比較 68 6.5 實驗三:本論文三種系統之比較 69 6.6 實驗四:跨性別與同性別轉換之比較 72 第7章 系統製作與聽力測試 75 7.1 HNM語音信號合成 75 7.2 音高調整 76 7.3 系統介面 77 7.4 程式製作 81 7.5 主觀聽測實驗 83 7.5.1 音色相似度測試 85 7.5.2 語音品質比較 88 7.5.3 客觀距離與音色音質之關聯性 90 第8章 結論 92 參考文獻 95 附錄(1) 注音符號與拼音之對應 99

[1] D. Erro, A. Moreno, A. Bonafonte, 「Voice Conversion Based on Weighted Frequency Warping」, IEEE Transactions on Audio, Speech, and Language processing, vol. 18, no. 5, July 2010.
[2] K. Y. Park, H. S. Kim, 「Narrowband to wideband conversion of speech using GMM based transformation,」 in Proc. ICASSP, vol. 3, pp. 1843–1846, 2000.
[3] D. O'Shaughnessy, Speech Communications 2/E, IEEE Press, 2000.
[4] O. Cappe, E. Moulines, 「Regularization Techniques for Discrete Cepstrum Estimation」, IEEE Signal processing letters, vol. 3, no. 4, pp.100-102, April 1996.
[5] En-Najjary, Taoufik, Rosec, Olivier, Chonavel, Thierry, 「A voice conversion method based on joint pitch and spectral envelope transformation」, in Interspeech, Jeju, Korea, pp. 1225-1228, 2004.
[6] 蔡松峰,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
[7] 楊志民,基於機率主成分分析之強健性語音辨認,國立台北科技大學電腦與通訊研究所碩士論文,2009。
[8] 李上銘,語音辨認中基於主成份分析之進一步技術,國立臺灣大學電信工程學研究所,2001。
[9] Shang-nien Tsai and Lin-shun Lee, 「Improved robust features for speech recognition by Histogram Equalization (HEQ) and integrating Time-Frequency Principal Components (TFPC)」, in ASRU, St. Thomas, U.S. Virgin Islands, 2003.
[10] T. Toda, Y. Ohtani, K. Shikano, 「Eigenvoice Conversion Based on Gaussian Mixture Model」,in ICSLP, pp. 2446-2449, 2006.
[11] Xiong Xiao, Jinyu Li, Eng Siong Chng, Haizhou Li, 「Maximum likelihood adaptation of histogram equalization with constraint for robust speech recognition」, in ICASSP, pp. 5480-5483, 2011.
[12] de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez, A. J. Rubio, 「Histogram equalization of speech representation for robust speech recognition」, IEEE Trans. Speech and Audio processing, vol. 13, no. 3, pp. 355–366, 2005.
[13] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, 「Voice conversion through vector quantization」, in Proc. ICASSP, New York, pp. 565–568, Apr. 1988.
[14] H. Valbret, E. Moulines, J. P. Tubach, 「Voice transformation using PSOLA technique」, Speech Communication, vol. 11, no. 2–3, pp.175–187, 1992.
[15] M. Narendranath, H. A. Murthy, S. Rajendran, B. Yegnanarayana, 「Transformation of formants for voice conversion using artificial neural networks」,」Speech Communication, vol. 16, pp. 207-216, Feb. 1995.
[16] Y. Stylianou, O. Capp' e, E. Moulines, 「Continuous probabilistic transform for voice conversion」, IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.
[17] S. Desai, E. Raghavendra, B. Yegnanarayana, A. Black, K. Prahallad, 「Voice conversion using artificial neural networks」, in IEEE WSLT, 2008.
[18] M. Larbi, B. Vincent, B. Olivier, 「Comparing GMM-based speech transformation systems」, in Interspeech, Antwerp, Belgium, pp. 1989-1992, 2007.
[19] E. Godoy, O. Rosec, T. Chonavel, 「Alleviating the One-to-Many Mapping Problem in Voice Conversion with Context-Dependent Modeling」, in Interspeech, Brighton, U. K., 2009.
[20] A. F. Machado, M. Queiroz, 「Techniques for Crosslingual Voice Conversion」, Proceedings of the IEEE International Symposium on Multimedia, 2010.
[21] 維基百科,直方圖均衡化,http://zh.wikipedia.org/wiki/直方圖均衡化
[22] 王小川,語音訊號處理(修訂二版),全華圖書公司,2009。
[23] K. Pearson, 「On lines and planes of closest fit to systems of points in space」, Philosophical Magazine, pp. 559–572, 1901.
[24] H. Hotelling, 「Analysis of a complex of statistical variables into principal components」, Journal of Educational Psychology, vol 24(6), pp. 417-441, Sep. 1933.
[25] H. Kawahara, 「STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds」, Acoustical Science and Technology, vol. 27, no. 6, pp.349-353, 2006.
[26] S. Young, 「The HTK Hidden Markov Model Toolkit: Design and Philosophy」, Tech Report TR.153, Department of Engineering, Cambridge University (UK), 1993.
[27] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程研究所碩士論文,2007。
[28] Wikipedia, 「Least squares」, http://en.wikipedia.org/wiki/Least_squares
[29] Y. Stylianou, 「Modeling speech based on harmonic plus noise models」, in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp.244-260, 2005.
[30] Y. Stylinaou, 「Harmonic plus Noise Models for Speech, Combined with Statistical Methods, for Speech and Speaker Modification」, Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.
[31] OpenCV, http://sourceforge.net/projects/opencvlibrary/.
[32] OpenCV統計應用-PCA主成分分析, http://yester-place.blogspot.com/2009/01/opencv-pca.html
[33] 張智星,"音訊處理與辨識",網路線上課程,可由作者之網頁 http://www.cs.nthu.edu.tw/~jang連結到此線上課程。

QR CODE