研究生: |
張仕承 Shih-Cheng Chang |
---|---|
論文名稱: |
使用韻律與頻譜特徵之情緒語音轉換 Emotional Voice Conversion Using Prosodic and Spectral Features |
指導教授: |
古鴻炎
Hung-Yan Gu |
口試委員: |
王新民
Hsin-Min Wang 余明興 Ming-Shing Yu 林伯慎 Bor-Shen Lin |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 88 |
中文關鍵詞: | 情緒語音 、語音轉換 、分段式韻律特徵 、頻譜高斯混合模型 、音高高斯混合模型 、動態式音長調整 |
外文關鍵詞: | emotional speech, voice conversion, segmental prosodic features, spectral GMM, F0 GMM, dynamic speech duration adjusting |
相關次數: | 點閱:402 下載:9 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文研究了三種韻律特徵(音高軌跡、音長、音量)的轉換方法,然後據以建造一個情緒語音轉換系統,在此情緒語音轉換指的是把輸入的中性語音轉換成具有生氣、開心或悲傷情緒的語音。在訓練階段,使用120句的平行語料,為三種目標情緒分別訓練出音高GMM與頻譜GMM模型,接著根據語句之分段規則,去計算三種情緒語音在各分段的跨語句之韻律參數的平均值與標準差。在轉換階段,使用訓練好的音高與頻譜GMM模型,分別將中性語音之音高軌跡與DCC頻譜係數對映成目標情緒語音之音高軌跡與DCC頻譜係數,由於音高GMM轉換會發生音高抖動的情況,因此我們研究以中值平滑處理及滑動平均處理來作改進;接著使用三種韻律參數的分段統計表,以分段式標準差匹配法去作音高、音量與音長的轉換,轉換後為了改善情緒語音轉換的效果,我們提出一種音長之動態調整方法,就是依各音框的能量比例值去動態作音框單位的音長伸縮。使用轉換出的情緒語音,我們進行了二項主觀聽測的實驗,第一項是不同轉換方法所轉出語音的情緒比較實驗,我們方法獲得的得票率分別為,生氣情緒95%、開心情緒65%、悲傷情緒67.5%;第二項是情緒辨別之實驗,我們方法得到的辨別率分別為,生氣情緒87.5%、開心情緒61.3%、悲傷情緒77.5%。所以,我們方法達成了不錯的情緒語音轉換效果。
In this thesis, conversion methods for three prosodic features (pitch contour, duration and intensity) are studied. Then, an emotional voice conversion system is constructed. A neutral input speech is converted to a speech of angry, happy or sad emotion. In the training stage, the F0 GMM and spectrum GMM models were trained for each of the three target emotions respectively by using the corresponding parallel corpus of 120 sentences. Based on sentence segmentation rules, the mean and standard deviation values of the prosodic features are measured across sentences for three segments respectively. Also, this measuring is performed for each target emotion’s training sentences respectively. In the conversion stage, the pitch contour and DCC coefficients of a neutral input speech are mapped to the pitch contour and DCC coefficients for a specified target emotion in terms of the corresponding F0 and spectrum GMM. When using F0 GMM to convert pitch contour, we find that the obtained pitch contour is of fluctuations. Therefore, we study to reduce the fluctuations with median smoothing and moving average processing. Next, by using segmental tables of statistical parameters obtained in the training stage, the three prosodic features (pitch contour, duration, and intensity) are converted with the method, segmental standard deviation matching (SSDM). To let the emotion expressed in the converted speech more close to the target emotion, we propose a dynamic speech duration adjusting method. The duration of a frame is dynamically determined according to its energy ratio.
To evaluate the performance of our emotional voice conversion system, we had conducted two subjective listening tests. The first test is to compare the emotional expressions of two converted speeches by two conversion methods. The percentages of the votes obtained by our method are 95% for angry emotion, 65% for happy emotion, and 67.5% for sad emotion. As to the second test, each participant is requested to recognize the emotion expressed in the speech played to him. The results show that the recognition rates obtained by our conversion method are 87.5% for angry emotion, 61.3% for happy emotion, and 77.5% for sad emotion. Therefore, the emotional voice conversion system using the studied conversion method is effective in converting a neutral speech to a speech of a specified target emotion.
[1] A. O. Ayodeji, and S. A. Oyetunji. "Voice conversion using coefficient mapping and neural network." Students on Applied Engineering (ISCAE), International Conference for. IEEE, 2016.
[2] D. O’Shaughnessy, Speech Communication:Human and Machine, 2’nd ed., IEEE Press, New York, 2000.
[3] O. Cappé, and E. Moulines, "Regularization techniques for discrete cepstrum estimation." IEEE Signal Processing Letters 3.4 (1996): 100-102.
[4] 蔡松峯,GMM為基礎之語音轉換法的改進,碩士論文,國立台灣科技大學資訊工程所,2009。
[5] T. Bänziger and K. R. Scherer, "The role of intonation in emotional expressions." Speech communication 46.3 (2005): 252-267.
[6] K. R. Scherer, "Vocal communication of emotion: A review of research paradigms." Speech communication 40.1 (2003): 227-256.
[7] M. Schröder, "Emotional speech synthesis: a review." Interspeech. 2001.
[8] L. Saheer, X. Na and M. Cernak, Syllabic Pitch Tuning for Neutral-to-Emotional Voice Conversion. No. EPFL-REPORT-213073. Idiap, 2015.
[9] A. K. Vuppala and S. R. Kadiri, "Neutral to anger speech conversion using non-uniform duration modification." Industrial and Information Systems (ICIIS), 2014 9th International Conference on. IEEE, 2014.
[10] D. Govind, S. M. Prasanna and B. Yegnanarayana, "Neutral to Target Emotion Conversion Using Source and Suprasegmental Information." Interspeech. 2011.
[11] J. R. Bellegarda, K. E. Silverman, K. Lenzo and V. Anderson, "Statistical prosodic modeling: from corpus design to parameter estimation." IEEE transactions on speech and audio processing 9.1 (2001): 52-66.
[12] R. Aihara, R. Takashima, T. Takiguchi and Y. Ariki, "GMM-based emotional voice conversion using spectrum and prosody features." American Journal of Signal Processing 2.5 (2012): 134-138.
[13] H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari and K. Shikano, "GMM-based voice conversion applied to emotional speech synthesis." Eurospeech. 2003.
[14] J. Tao, Y. Kang and A. Li, "Prosody conversion from neutral speech to emotional speech." IEEE Transactions on Audio, Speech, and Language Processing 14.4 (2006): 1145-1154.
[15] Y. Stylianou, O. Cappé and E. Moulines, "Continuous probabilistic transform for voice conversion." IEEE Transactions on speech and audio processing 6.2 (1998): 131-142.
[16] H. Valbret, E. Moulines and J. P. Tubach, "Voice transformation using PSOLA technique." Speech communication 11.2-3 (1992): 175-187.
[17] M. Narendranath, H. A. Murthy, S. Rajendran and B. Yegnanarayana, "Transformation of formants for voice conversion using artificial neural networks." Speech communication 16.2 (1995): 207-216.
[18] T. Toda, A. W. Black and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory." IEEE Transactions on Audio, Speech, and Language Processing 15.8 (2007): 2222-2235.
[19] M. Wang, M. Wen, K. Hirose and N. Minematsu, "Emotional voice conversion for mandarin using tone nucleus model–small corpus and high efficiency." Proc. Speech Prosody. 2012.
[20] Z. Inanoglu and S. J. Young, "Emotion conversion using F0 segment selection." Interspeech. 2008.
[21] Team, Audacity. "Audacity." and recorder. Version 2.0.6 (2014), Available: http://www.audacityteam.org
[22] S. Young, et al. "The HTK Book (for HTK Version 3.2. 1), 2002." Cambridge University Engineering Department, Cambridge, UK.
[23] K. Sjolander and J. Beskow, "Centre of Speech Technolodge at KTH.", Available: http://www.speech.kth.se/wavessurfer
[24] 吳昌益,使用頻譜演進模型之國語語音合成研究,碩士論文國立台灣科技大學資訊工程所,2007
[25] 張家維,使用主成分向量投影及最小均方對映之語音轉換方法,碩士論文,國立台灣科技大學資訊工程所,2012
[26] 陳彥樺,以聲學語言模型、全域變異數匹配及目標音框挑選作強化之語音轉換系統,碩士論文,國立台灣科技大學資訊工程所,2015
[27] T. Caliński and J. Harabasz, "A dendrite method for cluster analysis." Communications in Statistics-theory and Methods 3.1 (1974): 1-27.
[28] R. A. Redner and H. F. Walker, "Mixture densities, maximum likelihood and the EM algorithm." SIAM review 26.2 (1984): 195-239.
[29] 劉子睿,多語者漢語韻律模型之建立與其在語者韻律轉換之應用,碩士論文,國立交通大學電信工程研究所,2013
[30] Y. Stylianou, Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification, PhD Thesis, Ecole National Superieure des Telecommunications, 1996.