研究生: |
柯宇澤 Yu-Ze Ke |
---|---|
論文名稱: |
基於遷移學習之卷積神經網路於中文語音情緒辨識研究 Study of Transfer Learning-based Convolutional Neural Network for Chinese Speech Emotion Recognition |
指導教授: |
蔡明忠
Ming-Jong Tsai |
口試委員: |
李敏凡
Min-Fan Lee 張俊隆 Chun-Lung Chang |
學位類別: |
碩士 Master |
系所名稱: |
工程學院 - 自動化及控制研究所 Graduate Institute of Automation and Control |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 66 |
中文關鍵詞: | 語音情緒識別 、梅爾頻率倒譜係數 、數據增強 、遷移學習 |
外文關鍵詞: | Speech emotion recognition, Mel-frequency cepstral coefficients, Data augmentation, Transfer learning |
相關次數: | 點閱:312 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語音情緒辨識旨在從語音訊號中自動識別情緒和分類,隨著情緒識別的應用範圍不斷擴大,該領域的重要性日益凸顯。可以預見在未來語音情緒辨識將在各個領域中發揮更加關鍵的作用,為人們提供更智能、人性化和個性化的服務和體驗。目前大部分語音情緒識別研究都聚焦於英文作為研究對象,為了以中文作為情緒辨識的目標,本研究構建了一個中文語音情緒資料庫(NtustACK)。由於中文資料量不足的問題,本研究使用了三種數據增強的方法,包括過濾雜訊、音量調整和音高調整,將中文資料數據量擴增至原本的4倍,以增加數據的多樣性和規模。本研究為了進一步提高情緒辨識模型的性能使用了遷移學習的技術,並使用目前最齊全的英文語音情緒資料庫IEMOCAP與RAVDESS作為預訓練模型的訓練資料來訓練模型的權重,隨後使用數據擴增後的中文資料庫(NtustACK)進行遷移學習,使模型在中文情緒辨識上具有更好的表現,同時彌補NtustACK資料庫資料量不足的問題。本研究還探討了三種不同特徵(梅爾頻譜圖、梅爾頻率倒譜係數和色譜圖)與三種特徵疊合後的準確率,經比較後選擇準確率最高的特徵疊合做為卷積神經網路模型的輸入。在預訓練模型中成功分類了六種英文情緒開心、憤怒、傷心、無感、恐懼與厭惡的準確率達到了94.63%。最後在遷移學習的結果成功分類五種中文情緒開心、憤怒、傷心、無感與恐懼的準確率達到了92.67%。
Speech emotion recognition aims to identify and classify emotions from speech signals automatically. As the application scope of speech emotion recognition continues to expand, the significance of this field becomes increasingly prominent. Most researches on speech emotion recognition focus on English language as the subject of study. To target emotion recognition in Chinese, this study constructed a Chinese speech emotion database. Due to the scarcity of Chinese data, three data augmentation methods were employed in this research, including noise filtering, volume adjustment, and pitch modification, which expanded the Chinese dataset to four times its original size, enhancing data diversity and scale. This study further employed transfer learning techniques to improve the performance of the speech emotion recognition model. The most comprehensive English speech emotion databases, IEMOCAP and RAVDESS, were used as training data for pre-training the model's weights. Subsequently, the augmented Chinese database (NtustACK) was used for transfer learning, improving the model's performance in Chinese emotion recognition and compensating for the limited data in NtustACK. The study also investigated the accuracy of three different features (Mel spectrogram, Mel frequency cepstral coefficients, and spectrogram) and the accuracy after combining these three features. After comparison, the feature combination with the highest accuracy was selected as the input for the convolutional neural network model. The pre-trained model achieved an accuracy of 94.63% in classifying six English emotions (happy, angry, sad, neutral, fear, and disgust). In the final results of transfer learning, the model successfully classified five Chinese emotions (happy, angry, sad, neutral, and fear). with an accuracy of 92.67%.
[1] R. Cowie et al., "Emotion recognition in human-computer interaction," IEEE Signal Process. Mag., Review vol. 18, no. 1, pp. 32-80, Jan 2001, doi: 10.1109/79.911197.
[2] A. Jaimes and N. Sebe, "Multimodal human-computer interaction: A survey," Comput. Vis. Image Underst., Review vol. 108, no. 1-2, pp. 116-134, Oct-Nov 2007, doi: 10.1016/j.cviu.2006.10.019.
[3] 聲藝編輯. "大數據時代來臨,隱藏在話語中的心情AI都知道! 「嘿Siri!」語音識別功能的再升級 人類複雜又細膩的情緒變化." https://metavoice.tw/articles/vocal-emotion-recognition (2023, 03, 01).
[4] P. Ekman, "An argument for basic emotions," Cognition & emotion, vol. 6, no. 3-4, pp. 169-200, 1992.
[5] K. R. Scherer and H. Ellgring, "Multimodal expression of emotion: Affect programs or componential appraisal patterns?," Emotion, vol. 7, no. 1, pp. 158-171, Feb 2007, doi: 10.1037/1528-3542.7.1.158.
[6] Z. K. Abdul and A. K. Al-Talabani, "Mel Frequency Cepstral Coefficient and its Applications: A Review," IEEE Access, Review vol. 10, pp. 122136-122158, 2022, doi: 10.1109/access.2022.3223444.
[7] J. R. Fontaine, K. R. Scherer, E. B. Roesch, and P. C. Ellsworth, "The world of emotions is not two-dimensional," Psychological science, vol. 18, no. 12, pp. 1050-1057, 2007.
[8] R. Jahangir, Y. W. Teh, G. Mujtaba, R. Alroobaea, Z. H. Shaikh, and I. Ali, "Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion," Machine Vision and Applications, vol. 33, no. 3, p. 41, 2022.
[9] D. Hazarika, S. Poria, R. Zimmermann, and R. Mihalcea, "Conversational transfer learning for emotion recognition," Inf. Fusion, vol. 65, pp. 1-12, Jan 2021, doi: 10.1016/j.inffus.2020.06.005.
[10] D. Nguyen et al., "Meta-transfer learning for emotion recognition," Neural Comput. Appl., vol. 35, no. 14, pp. 10535-10549, May 2023, doi: 10.1007/s00521-023-08248-y.
[11] A. Varga and H. J. M. Steeneken, "ASSESSMENT FOR AUTOMATIC SPEECH RECOGNITION .2. NOISEX-92 - A DATABASE AND AN EXPERIMENT TO STUDY THE EFFECT OF ADDITIVE NOISE ON SPEECH RECOGNITION SYSTEMS," Speech Commun., vol. 12, no. 3, pp. 247-251, Jul 1993, doi: 10.1016/0167-6393(93)90095-3.
[12] M. Cooke, J. Barker, S. Cunningham, and X. Shao, "An audio-visual corpus for speech perception and automatic speech recognition (L)," J. Acoust. Soc. Am., vol. 120, no. 5, pp. 2421-2424, Nov 2006, doi: 10.1121/1.2229005.
[13] 李其芳,「基於殘差網路的語音情緒辨識—使用梅爾特徵與色譜圖,」 碩士論文, 電機工程學系, 國立陽明交通大學, 新竹市, 2022. Available: https://hdl.handle.net/11296/649h9g
[14] A. A. Anthony and C. M. Patil, "Speech Emotion Recognition Systems: A Comprehensive Review on Different Methodologies," Wirel. Pers. Commun., vol. 130, Review; Early Access p. 11, 2023 Mar 2023, doi: 10.1007/s11277-023-10296-5.
[15] J. de Lope and M. Grana, "An ongoing review of speech emotion recognition," Neurocomputing, Review vol. 528, pp. 1-11, Apr 2023, doi: 10.1016/j.neucom.2023.01.002.
[16] 張傑智. "電腦與通訊 語音情緒辨識概述." https://jictcms.itri.org.tw/xcdoc/cont?xsmsid=0M236556470056558161&qcat=0M236615929154970794&sid=0M256565184351531469 (2023, 06, 23).
[17] 周維忠. "從單一資料源挺進多模態 解析判讀六種基本情緒 情緒辨識AI應用如火如荼 全盤考量才能避免誤判." https://www.netadmin.com.tw/netadmin/zh-tw/trend/4529C9A1EB0D414CBD723B2BC10C5C24 (2021, 02, 03).
[18] J. Hirschberg and C. D. Manning, "Advances in natural language processing," Science, Review vol. 349, no. 6245, pp. 261-266, Jul 2015, doi: 10.1126/science.aaa8685.
[19] D. Hazarika, S. Poria, R. Zimmermann, and R. Mihalcea, "Emotion recognition in conversations with transfer learning from generative conversation modeling," arXiv preprint arXiv:1910.04980, 2019.
[20] D. Ververidis and C. Kotropoulos, "Emotional speech recognition: Resources, features, and methods," Speech Commun., Review vol. 48, no. 9, pp. 1162-1181, Sep 2006, doi: 10.1016/j.specom.2006.04.003.
[21] M. El Ayadi, M. S. Kamel, and F. Karray, "Survey on speech emotion recognition: Features, classification schemes, and databases," Pattern Recognit., vol. 44, no. 3, pp. 572-587, Mar 2011, doi: 10.1016/j.patcog.2010.09.020.
[22] 張智星. "Audio Signal Processing." https://www.youtube.com/playlist?list=PLwb75-SEfInnOgN5bTtxah2dMrtrbInON (2017, 01, 03).
[23] M. C. Sezgin, B. Gunsel, and G. K. Kurt, "Perceptual audio features for emotion detection," EURASIP J. Audio Speech Music Process.,vol. 2012, p. 21, 2012, Art no. 16, doi: 10.1186/1687-4722-2012-16.
[24] J. F. Zhao, X. Mao, and L. J. Chen, "Speech emotion recognition using deep 1D & 2D CNN LSTM networks," Biomed. Signal Process. Control, vol. 47, pp. 312-323, Jan 2019, doi: 10.1016/j.bspc.2018.08.035.
[25] D. Issa, M. F. Demirci, and A. Yazici, "Speech emotion recognition with deep convolutional neural networks," Biomed. Signal Process. Control, vol. 59, p. 11, May 2020, Art no. 101894, doi: 10.1016/j.bspc.2020.101894.
[26] R. D. Melara and L. E. Marks, "INTERACTION AMONG AUDITORY DIMENSIONS - TIMBRE, PITCH, AND LOUDNESS," Percept. Psychophys., vol. 48, no. 2, pp. 169-178, Aug 1990, doi: 10.3758/bf03207084.
[27] R. Li and M. Zhang, "Singing-Voice Timbre Evaluations Based on Transfer Learning," Applied Sciences, vol. 12, no. 19, p. 9931, 2022.
[28] Y. Zhou, Y. Sun, J. Zhang, and Y. Yan, "Speech emotion recognition using both spectral and prosodic features," in 2009 international conference on information engineering and computer science, 2009: IEEE, pp. 1-4.
[29] C. Luna-Jimenez, D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero, and F. Fernandez-Martinez, "Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning," Sensors, vol. 21, no. 22, p. 29, Nov 2021, Art no. 7665, doi: 10.3390/s21227665.
[30] P. Song, Y. Jin, L. Zhao, and M. H. Xin, "Speech Emotion Recognition Using Transfer Learning," IEICE Trans. Inf. Syst., vol. E97D, no. 9, pp. 2530-2532, Sep 2014, doi: 10.1587/transinf.2014EDL8038.
[31] K. X. Feng and T. Chaspari, "A Review of Generalizable Transfer Learning in Automatic Emotion Recognition," Front. Comput. Sci.-Switz, Review vol. 2, p. 14, Feb 2020, Art no. 9, doi: 10.3389/fcomp.2020.00009.
[32] R. Yacouby and D. Axman, "Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models," in Proceedings of the first workshop on evaluation and comparison of NLP systems, 2020, pp. 79-91.
[33] S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English," PloS one, vol. 13, no. 5, p. e0196391, 2018.
[34] B. B. Al-onazi, M. A. Nauman, R. Jahangir, M. M. Malik, E. H. Alkhammash, and A. M. Elshewey, "Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion," Appl. Sci.-Basel, vol. 12, no. 18, p. 17, Sep 2022, Art no. 9188, doi: 10.3390/app12189188.
[35] D. S. Park et al., "Specaugment: A simple data augmentation method for automatic speech recognition," arXiv preprint arXiv:1904.08779, 2019.
[36] W.-N. Hsu, Y. Zhang, and J. Glass, "Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation," in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017: IEEE, pp. 16-23.
[37] T. A. M. Celin, P. Vijayalakshmi, and T. Nagarajan, "Data Augmentation Techniques for Transfer Learning-Based Continuous Dysarthric Speech Recognition," Circuits Syst. Signal Process., vol. 42, no. 1, pp. 601-622, Jan 2023, doi: 10.1007/s00034-022-02156-7.
[38] Q. Li, J. Zheng, A. Tsai, and Q. Zhou, "Robust endpoint detection and energy normalization for real-time speech and speaker recognition," IEEE Transactions on Speech and Audio Processing, vol. 10, no. 3, pp. 146-157, 2002.
[39] B. McFee et al., "librosa: Audio and music signal analysis in python," in Proceedings of the 14th python in science conference, 2015, vol. 8, pp. 18-25.
[40] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.