簡易檢索 / 詳目顯示

研究生: 王讚緯
Tsan-Wei Wang
論文名稱: 使用直方圖等化及目標音框挑選之語音轉換系統
A Voice Conversion System Using Histogram Equalization and Target Frame Selection
指導教授: 古鴻炎
Hung-Yan Gu
口試委員: 范欽雄
Chin-Shyurng Fahn
王家慶
Jia-Ching Wang
王新民
Hsin-Min Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 中文
論文頁數: 70
中文關鍵詞: 音段式音框對齊直方圖等化線性多變量迴歸目標音框挑選語音轉換
外文關鍵詞: segment-based frame alignment, linear multivariate regression, target frame selection
相關次數: 點閱:243下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本論文採用線性多變量迴歸(linear multivariate regression, LMR)作為頻譜對映機制,並且加入頻譜係數的直方圖等化(histogram equalization, HEQ)及目標音框挑選(target frame selection, TFS)的處理,希望藉以改善高斯混合模型(GMM)頻譜對映常遇到的頻譜過度平滑的問題,以提昇轉換出的語音品質。另外,由於平行語料的取得是不易的,因此我們研究了從非平行語料去建構模擬的平行語料的方法,然後使用非平行語料去實作四種語音轉換系統,分別為LMR系統、LMR+TFS系統、HEQ+LMR系統與HEQ+LMR+TFS系統。在訓練階段我們使用音段式音框對齊,來建造模擬的平行語料,再用模擬的平行語料,去分別訓練四種系統的模型參數。作直方圖等化處理時,需先是把離散倒頻譜係數(DCC)轉換成主成分分析(PCA)係數,再把PCA係數轉換成累積密度函數(CDF)係數;而作目標音框挑選時則是依據一個音框的音段類別編號及LMR對映出的DCC向量,到目標語者相同音段類別所收集到的音框集中,去搜尋出距離較小的目標語者DCC向量,再用以取代原先對映出的DCC向量。在測試階段,由客觀DCC誤差距離量測的結果可以發現,加入直方圖等化的處理可使平均DCC誤差距離減小,但加入目標音框挑選後,反而會使平均DCC誤差變大。不過,依變異數比值(variance ratio, VR)與主觀聽測的結果可知,加入目標音框挑選的確可使語音品質提昇,並且平均DCC誤差距離變大不表示語音品質就會變差。


Is this thesis, linear multivariate regression (LMR) is adopted for spectrum mapping. In addition, histogram equalization (HEQ) of spectral coefficients and target frame selection (TFS) are included to our system. We intend to solve the problem of spectral over-smoothing encountered by the conventional GMM (Gaussian mixture model) based mapping mechanism in order to improve the converted voice quality. Also, we notice that parallel training sentences are hard to prepare. Therefore we study a method to construct an imitative parallel corpus from a nonparallel corpus. Next, we use a nonparallel corpus to build four voice conversion systems: LMR, LMR+TFS, HEQ+LMR and HEQ+TFS. In the training stage, the method, segment-based frame alignment, is refined to construct the imitative parallel corpus. Then, the corpus is used to train the model parameters for the four voice conversion systems respectively. In the module for HEQ, discrete cepstral coefficients (DCC) are first transformed to principle-component-analysis (PCA) coefficients, and then transformed to cumulative-density-function (CDF) coefficient. In the module for TFS, a DCC vector obtained from LMR mapping and its segment-class number are used to search the corresponding frame set consisting of target-speaker frames belonging to the same segment class. Then, the DCC vector of a frame in the frame set that is nearest to the LMR mapped DCC vector is found and taken to replace the mapped DCC vector. In the conversion stage, it is seen that the HEQ module can decreases the average DCC error, but the TFS module causes the average DCC error being increased. However, the TFS module can really improve the converted voice quality according to the measure of variance ratio. Therefore, the increased average DCC error does not indicate the converted voice quality is worsened.

摘要 I ABSTRACT II 誌謝 III 目錄 IV 圖表索引 VI 第1章 緒論 1 1.1 研究動機 1 1.2 文獻回顧 2 1.2.1 頻譜特徵參數 2 1.2.2 語音轉換方法 3 1.2.3 韻律轉換方法 6 1.3 研究方法 7 1.3.1 LMR+TFS系統 8 1.3.2 HEQ+LMR+TFS系統 11 1.4 論文架構 14 第2章 語料準備與頻譜特徵擷取 16 2.1 標音、切音 16 2.2 標音、切音 16 2.3 音段分類與文脈資訊 18 2.3.1 聲、韻母分類 19 2.3.2 文脈資訊 20 第3章 音段式音框對齊 24 3.1 音段之動態時間校正 24 3.2 目標音段挑選 26 3.3 動態規劃演算法 29 3.4 音框收集 31 第4章 主成分分析與直方圖等化 32 4.1 主成分分析簡介 32 4.2 主成分分析 33 4.2.1 主成分向量計算 33 4.2.2 主成分係數轉換與反轉換 34 4.3 直方圖等化簡介 35 4.4 直方圖等化分析 36 4.4.1 HEQ表格 37 4.4.2 CDF係數轉換與反轉換 38 第5章 線性多變量迴歸與目標音框挑選 40 5.1 LMR簡介 40 5.2 LMR對映矩陣訓練 40 5.3 LMR對映 43 5.4 目標音框挑選 43 第6章 語音轉換實驗 46 6.1 語者配對及語料配置 46 6.2 平均DCC誤差及變異數比值量測 46 6.3 LMR系統之實驗 47 6.4 HEQ+LMR系統之實驗 49 6.5 LMR+TFS系統之實驗 50 6.6 HEQ+LMR+TFS系統之實驗 51 6.7 實驗結果探討 52 第7章 系統製作與聽覺測試 55 7.1 HNM語音信號合成 55 7.2 音高轉換 56 7.3 系統介面 57 7.4 程式製作 60 7.5 語音品質聽測實驗 61 第8章 結論 64 參考文獻 67

[1] A. de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez and A. J. Rubio, “Histogram equalization of speech representation for robust speech recognition”, IEEE Trans. Speech and Audio processing, vol. 13, no. 3, pp. 355–366, 2005.
[2] D. Erro, A. Moreno and A. Bonafonte, “Voice conversion based on weighted frequency warping”, IEEE Trans. on Audio, Speech, and Language Processing, vol. 18, no. 5, 2010.
[3] D. Erro and A. Moreno, “Frame alignment method for cross-lingual voice conversion”, in Interspeech, pp. 1969–1972, Antwerp, Belgium, 2007.
[4] D. O’Shaughnessy, Speech Communications 2/E, IEEE Press, 2000.
[5] D. Zeng and Y. Yu, “Voice conversion using structured Gaussian mixture model”, ICSP, pp. 541-544, 2010.
[6] E. E. Helander and J. Nurminen, “A novel method for prosody prediction in voice conversion”, ICASSP, vol. 4, pp. 509-512, 2007.
[7] E. Godoy, O. Rosec and T. Chonavel, “Alleviating the one-to-many mapping problem in voice conversion with context-dependent modeling”, in Interspeech, Brighton, U. K., 2009.
[8] E. Godoy, O. Rosec and T. Chonavel, "Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora", IEEE trans. Audio, Speech, and Language Processing, vol. 20, pp. 1313-1323, 2012.
[9] Histogram equalization, http://www.cs.utah.edu/~jfishbau/improc/project2/
[10] H. Valbret, E. Moulines and J. P. Tubach, “Voice transformation using PSOLA technique”, Speech Communication, vol. 11, no. 2–3, pp.175–187, 1992.
[11] H. Y. Gu and S. F. Tsai, “A discrete-cepstrum based spectrum-envelope estimation scheme and its example application of voice transformation”, International Journal of Computational Linguistics and Chinese Language Processing, vol. 14, No. 4, pp.363-382, 2009.
[12] K. Pearson, “On lines and planes of closest fit to systems of points in space”, Philosophical Magazine, vol. 2, No. 6. , pp. 559-572, 1901.
[13] K.Sjolander and J.Beskow, Centre of Speech Technolodgy at KTH, http://www.speech.kth.se/wavesurfer/.
[14] K. Y. Park and H. S. Kim, “Narrowband to wideband conversion of speech using GMM based transformation,” in Proc. ICASSP, vol. 3, pp. 1843–1846, 2000.
[15] M. Abe, S. Nakamura, K. Shikano and H. Kuwabara, “Voice conversion through vector quantization”, in Proc. ICASSP, New York, pp. 565–568, 1988.
[16] M. Narendranath, H. A. Murthy, S. Rajendran and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks”, Speech Communication, vol. 16, pp. 207-216, 1995.
[17] O. Cappe and E. Moulines, “Regularization techniques for discrete cepstrum estimation”, IEEE Signal Processing letters, vol. 3, no. 4, pp.100-102, 1996.
[18] OpenCV 1.0, http://sourceforge.net/projects/opencvlibrary/
[19] S. Young, “The HTK hidden Markov model toolkit: design and philosophy”, Tech Report TR.153, Department of Engineering, Cambridge University (UK), 1993.
[20] T. Dutoit, A. Holzapfel, M. Jottrand, A. Moinet, J. Perez and Y. Stylianou, “Toward a voice conversion system based on frame selection”, ICASSP, vol. 4, pp. 513-516, 2007.
[21] T. En-Najjary, O. Rosec and T. Chonavel, “A voice conversion method based on joint pitch and spectral envelope transformation”, in Interspeech, Jeju, Korea, pp. 1225-1228, 2004.
[22] T. Toda, Y. Ohtani and K. Shikano, “Eigenvoice conversion based on Gaussian mixture model”, in Interspeech, Pittsburgh, PA, USA , pp. 2446-2449, 2006.
[23] X. Xiao, J. Li, E. S. Chng and H. Li, “Maximum likelihood adaptation of histogram equalization with constraint for robust speech recognition”, in ICASSP, pp. 5480-5483, 2011.
[24] Y. Stylianou, O. Capp’ e and E. Moulines, “Continuous probabilistic transform for voice conversion”, IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.
[25] Y. Stylianou, “Modeling speech based on harmonic plus noise models”, in Nonlinear Speech Modeling and Applications, eds. G. Chollet et al., Springer-Verlag, Berlin, pp.244-260, 2005.
[26] Y. Stylinaou, “Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification”, Ph. D. thesis, Ecole National Superieure des Telecommunications, Paris, France, 1996.
[27] Z. Wu, E. S. Chng and H. Li, “Segment-based frame alignment for text-independent voice conversion”, Technical Report, 2012.
[28] Z. Z. Wu, T. Kinnunen, E. S. Chng and H. Li, “Text-independent F0 transformation with non-parallel data for voice conversion”, in Interspeech, pp. 1732-1735, Makuhari, Chiba, Japan, 2010.
[29] 王小川,語音訊號處理(修訂二版),全華圖書公司,2009。
[30] 古鴻炎,張家維,王讚緯,”以線性多變量迴歸來對映分段後音框之語音轉換方法”,ROCLING,2012。
[31] 古鴻炎,張家維,”基於音段式LMR 對映之語音轉換方法的改進”,ROCLING,2013。
[32] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程所碩士論文,2007。
[33] 張家維,使用主成分向量投影及最小均方對映之語音轉換方法,國立台灣科技大學資訊工程所碩士論文,2012。
[34] 蔡松峯,GMM為基礎之語音轉換法的改進,國立台灣科技大學資訊工程所碩士論文,2009。
[35] 簡延庭,基於HMM模型之歌聲合成語音色轉換,國立台灣科技大學資訊工程所碩士論文,2013。

QR CODE