簡易檢索 / 詳目顯示

研究生: 張仕承
Shih-Cheng Chang
論文名稱: 使用韻律與頻譜特徵之情緒語音轉換
Emotional Voice Conversion Using Prosodic and Spectral Features
指導教授: 古鴻炎
Hung-Yan Gu
口試委員: 王新民
Hsin-Min Wang
余明興
Ming-Shing Yu
林伯慎
Bor-Shen Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 88
中文關鍵詞: 情緒語音語音轉換分段式韻律特徵頻譜高斯混合模型音高高斯混合模型動態式音長調整
外文關鍵詞: emotional speech, voice conversion, segmental prosodic features, spectral GMM, F0 GMM, dynamic speech duration adjusting
相關次數: 點閱:402下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本論文研究了三種韻律特徵(音高軌跡、音長、音量)的轉換方法,然後據以建造一個情緒語音轉換系統,在此情緒語音轉換指的是把輸入的中性語音轉換成具有生氣、開心或悲傷情緒的語音。在訓練階段,使用120句的平行語料,為三種目標情緒分別訓練出音高GMM與頻譜GMM模型,接著根據語句之分段規則,去計算三種情緒語音在各分段的跨語句之韻律參數的平均值與標準差。在轉換階段,使用訓練好的音高與頻譜GMM模型,分別將中性語音之音高軌跡與DCC頻譜係數對映成目標情緒語音之音高軌跡與DCC頻譜係數,由於音高GMM轉換會發生音高抖動的情況,因此我們研究以中值平滑處理及滑動平均處理來作改進;接著使用三種韻律參數的分段統計表,以分段式標準差匹配法去作音高、音量與音長的轉換,轉換後為了改善情緒語音轉換的效果,我們提出一種音長之動態調整方法,就是依各音框的能量比例值去動態作音框單位的音長伸縮。使用轉換出的情緒語音,我們進行了二項主觀聽測的實驗,第一項是不同轉換方法所轉出語音的情緒比較實驗,我們方法獲得的得票率分別為,生氣情緒95%、開心情緒65%、悲傷情緒67.5%;第二項是情緒辨別之實驗,我們方法得到的辨別率分別為,生氣情緒87.5%、開心情緒61.3%、悲傷情緒77.5%。所以,我們方法達成了不錯的情緒語音轉換效果。


In this thesis, conversion methods for three prosodic features (pitch contour, duration and intensity) are studied. Then, an emotional voice conversion system is constructed. A neutral input speech is converted to a speech of angry, happy or sad emotion. In the training stage, the F0 GMM and spectrum GMM models were trained for each of the three target emotions respectively by using the corresponding parallel corpus of 120 sentences. Based on sentence segmentation rules, the mean and standard deviation values of the prosodic features are measured across sentences for three segments respectively. Also, this measuring is performed for each target emotion’s training sentences respectively. In the conversion stage, the pitch contour and DCC coefficients of a neutral input speech are mapped to the pitch contour and DCC coefficients for a specified target emotion in terms of the corresponding F0 and spectrum GMM. When using F0 GMM to convert pitch contour, we find that the obtained pitch contour is of fluctuations. Therefore, we study to reduce the fluctuations with median smoothing and moving average processing. Next, by using segmental tables of statistical parameters obtained in the training stage, the three prosodic features (pitch contour, duration, and intensity) are converted with the method, segmental standard deviation matching (SSDM). To let the emotion expressed in the converted speech more close to the target emotion, we propose a dynamic speech duration adjusting method. The duration of a frame is dynamically determined according to its energy ratio.
To evaluate the performance of our emotional voice conversion system, we had conducted two subjective listening tests. The first test is to compare the emotional expressions of two converted speeches by two conversion methods. The percentages of the votes obtained by our method are 95% for angry emotion, 65% for happy emotion, and 67.5% for sad emotion. As to the second test, each participant is requested to recognize the emotion expressed in the speech played to him. The results show that the recognition rates obtained by our conversion method are 87.5% for angry emotion, 61.3% for happy emotion, and 77.5% for sad emotion. Therefore, the emotional voice conversion system using the studied conversion method is effective in converting a neutral speech to a speech of a specified target emotion.

摘要 I ABSTRCT II 誌謝 III 目錄 IV 圖表索引 VI 第1章 緒論 1 1.1 研究動機 1 1.2 文獻回顧 1 1.2.1 頻譜特徵係數 2 1.2.2 韻律參數之轉換方法 3 1.2.3 情緒語音轉換之方法 5 1.3 研究方法 8 1.3.1 情緒語音轉換系統之訓練階段 10 1.3.2 情緒語音轉換系統之轉換階段 13 1.4 論文架構 17 第2章 語料準備與情緒參數估計 18 2.1 語料錄音 18 2.2 標音 18 2.3 離散倒頻譜係數估計 21 2.4 音框音高、音量與音長係數估計 23 2.5 DTW音框匹配 24 第3章 情緒語音轉換模型之訓練 27 3.1 高斯混合模型(GMM) 27 3.2 GMM 模型訓練方法 29 3.3 音高GMM之訓練 30 3.4 頻譜GMM之訓練 33 3.5 韻律參數之分段統計表 35 第4章 情緒語音轉換方法 41 4.1 頻譜係數轉換方法 41 4.2 韻律參數轉換方法 42 4.2.1 音高轉換方法 42 4.2.2 音量轉換方法 45 4.2.3 音長轉換方法 47 第5章 系統整合與實驗 51 5.1 程式製作與系統介面 51 5.2 HNM訊號合成 54 5.3 音高轉換實驗 56 5.4 音量轉換實驗 59 5.5 音長轉換實驗 61 5.6 頻譜距離量測實驗 63 5.7 聽測實驗 64 5.7.1 聽測實驗一:情緒語音比較 65 5.7.2 聽測實驗二:情緒語音辨別 67 第6章 結論 70 參考文獻 76

[1] A. O. Ayodeji, and S. A. Oyetunji. "Voice conversion using coefficient mapping and neural network." Students on Applied Engineering (ISCAE), International Conference for. IEEE, 2016.
[2] D. O’Shaughnessy, Speech Communication:Human and Machine, 2’nd ed., IEEE Press, New York, 2000.
[3] O. Cappé, and E. Moulines, "Regularization techniques for discrete cepstrum estimation." IEEE Signal Processing Letters 3.4 (1996): 100-102.
[4] 蔡松峯,GMM為基礎之語音轉換法的改進,碩士論文,國立台灣科技大學資訊工程所,2009。
[5] T. Bänziger and K. R. Scherer, "The role of intonation in emotional expressions." Speech communication 46.3 (2005): 252-267.
[6] K. R. Scherer, "Vocal communication of emotion: A review of research paradigms." Speech communication 40.1 (2003): 227-256.
[7] M. Schröder, "Emotional speech synthesis: a review." Interspeech. 2001.
[8] L. Saheer, X. Na and M. Cernak, Syllabic Pitch Tuning for Neutral-to-Emotional Voice Conversion. No. EPFL-REPORT-213073. Idiap, 2015.
[9] A. K. Vuppala and S. R. Kadiri, "Neutral to anger speech conversion using non-uniform duration modification." Industrial and Information Systems (ICIIS), 2014 9th International Conference on. IEEE, 2014.
[10] D. Govind, S. M. Prasanna and B. Yegnanarayana, "Neutral to Target Emotion Conversion Using Source and Suprasegmental Information." Interspeech. 2011.
[11] J. R. Bellegarda, K. E. Silverman, K. Lenzo and V. Anderson, "Statistical prosodic modeling: from corpus design to parameter estimation." IEEE transactions on speech and audio processing 9.1 (2001): 52-66.
[12] R. Aihara, R. Takashima, T. Takiguchi and Y. Ariki, "GMM-based emotional voice conversion using spectrum and prosody features." American Journal of Signal Processing 2.5 (2012): 134-138.
[13] H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari and K. Shikano, "GMM-based voice conversion applied to emotional speech synthesis." Eurospeech. 2003.
[14] J. Tao, Y. Kang and A. Li, "Prosody conversion from neutral speech to emotional speech." IEEE Transactions on Audio, Speech, and Language Processing 14.4 (2006): 1145-1154.
[15] Y. Stylianou, O. Cappé and E. Moulines, "Continuous probabilistic transform for voice conversion." IEEE Transactions on speech and audio processing 6.2 (1998): 131-142.
[16] H. Valbret, E. Moulines and J. P. Tubach, "Voice transformation using PSOLA technique." Speech communication 11.2-3 (1992): 175-187.
[17] M. Narendranath, H. A. Murthy, S. Rajendran and B. Yegnanarayana, "Transformation of formants for voice conversion using artificial neural networks." Speech communication 16.2 (1995): 207-216.
[18] T. Toda, A. W. Black and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory." IEEE Transactions on Audio, Speech, and Language Processing 15.8 (2007): 2222-2235.
[19] M. Wang, M. Wen, K. Hirose and N. Minematsu, "Emotional voice conversion for mandarin using tone nucleus model–small corpus and high efficiency." Proc. Speech Prosody. 2012.
[20] Z. Inanoglu and S. J. Young, "Emotion conversion using F0 segment selection." Interspeech. 2008.
[21] Team, Audacity. "Audacity." and recorder. Version 2.0.6 (2014), Available: http://www.audacityteam.org
[22] S. Young, et al. "The HTK Book (for HTK Version 3.2. 1), 2002." Cambridge University Engineering Department, Cambridge, UK.
[23] K. Sjolander and J. Beskow, "Centre of Speech Technolodge at KTH.", Available: http://www.speech.kth.se/wavessurfer
[24] 吳昌益,使用頻譜演進模型之國語語音合成研究,碩士論文國立台灣科技大學資訊工程所,2007
[25] 張家維,使用主成分向量投影及最小均方對映之語音轉換方法,碩士論文,國立台灣科技大學資訊工程所,2012
[26] 陳彥樺,以聲學語言模型、全域變異數匹配及目標音框挑選作強化之語音轉換系統,碩士論文,國立台灣科技大學資訊工程所,2015
[27] T. Caliński and J. Harabasz, "A dendrite method for cluster analysis." Communications in Statistics-theory and Methods 3.1 (1974): 1-27.
[28] R. A. Redner and H. F. Walker, "Mixture densities, maximum likelihood and the EM algorithm." SIAM review 26.2 (1984): 195-239.
[29] 劉子睿,多語者漢語韻律模型之建立與其在語者韻律轉換之應用,碩士論文,國立交通大學電信工程研究所,2013
[30] Y. Stylianou, Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification, PhD Thesis, Ecole National Superieure des Telecommunications, 1996.

QR CODE