簡易檢索 / 詳目顯示

研究生: 柯宇澤
Yu-Ze Ke
論文名稱: 基於遷移學習之卷積神經網路於中文語音情緒辨識研究
Study of Transfer Learning-based Convolutional Neural Network for Chinese Speech Emotion Recognition
指導教授: 蔡明忠
Ming-Jong Tsai
口試委員: 李敏凡
Min-Fan Lee
張俊隆
Chun-Lung Chang
學位類別: 碩士
Master
系所名稱: 工程學院 - 自動化及控制研究所
Graduate Institute of Automation and Control
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 66
中文關鍵詞: 語音情緒識別梅爾頻率倒譜係數數據增強遷移學習
外文關鍵詞: Speech emotion recognition, Mel-frequency cepstral coefficients, Data augmentation, Transfer learning
相關次數: 點閱:261下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音情緒辨識旨在從語音訊號中自動識別情緒和分類,隨著情緒識別的應用範圍不斷擴大,該領域的重要性日益凸顯。可以預見在未來語音情緒辨識將在各個領域中發揮更加關鍵的作用,為人們提供更智能、人性化和個性化的服務和體驗。目前大部分語音情緒識別研究都聚焦於英文作為研究對象,為了以中文作為情緒辨識的目標,本研究構建了一個中文語音情緒資料庫(NtustACK)。由於中文資料量不足的問題,本研究使用了三種數據增強的方法,包括過濾雜訊、音量調整和音高調整,將中文資料數據量擴增至原本的4倍,以增加數據的多樣性和規模。本研究為了進一步提高情緒辨識模型的性能使用了遷移學習的技術,並使用目前最齊全的英文語音情緒資料庫IEMOCAP與RAVDESS作為預訓練模型的訓練資料來訓練模型的權重,隨後使用數據擴增後的中文資料庫(NtustACK)進行遷移學習,使模型在中文情緒辨識上具有更好的表現,同時彌補NtustACK資料庫資料量不足的問題。本研究還探討了三種不同特徵(梅爾頻譜圖、梅爾頻率倒譜係數和色譜圖)與三種特徵疊合後的準確率,經比較後選擇準確率最高的特徵疊合做為卷積神經網路模型的輸入。在預訓練模型中成功分類了六種英文情緒開心、憤怒、傷心、無感、恐懼與厭惡的準確率達到了94.63%。最後在遷移學習的結果成功分類五種中文情緒開心、憤怒、傷心、無感與恐懼的準確率達到了92.67%。


    Speech emotion recognition aims to identify and classify emotions from speech signals automatically. As the application scope of speech emotion recognition continues to expand, the significance of this field becomes increasingly prominent. Most researches on speech emotion recognition focus on English language as the subject of study. To target emotion recognition in Chinese, this study constructed a Chinese speech emotion database. Due to the scarcity of Chinese data, three data augmentation methods were employed in this research, including noise filtering, volume adjustment, and pitch modification, which expanded the Chinese dataset to four times its original size, enhancing data diversity and scale. This study further employed transfer learning techniques to improve the performance of the speech emotion recognition model. The most comprehensive English speech emotion databases, IEMOCAP and RAVDESS, were used as training data for pre-training the model's weights. Subsequently, the augmented Chinese database (NtustACK) was used for transfer learning, improving the model's performance in Chinese emotion recognition and compensating for the limited data in NtustACK. The study also investigated the accuracy of three different features (Mel spectrogram, Mel frequency cepstral coefficients, and spectrogram) and the accuracy after combining these three features. After comparison, the feature combination with the highest accuracy was selected as the input for the convolutional neural network model. The pre-trained model achieved an accuracy of 94.63% in classifying six English emotions (happy, angry, sad, neutral, fear, and disgust). In the final results of transfer learning, the model successfully classified five Chinese emotions (happy, angry, sad, neutral, and fear). with an accuracy of 92.67%.

    致謝 I 摘要 II ABSTRACT III 目錄 IV 圖目錄 VI 表目錄 VIII 第一章 緒論 1 1.1前言 1 1.2研究動機 1 1.3研究方法 2 1.4本文架構 4 第二章 文獻回顧與相關技術 5 2.1 人類情緒相關文獻回顧 5 2.2 語音識別相關文獻回顧 6 2.3 語音特徵 7 2.3.1 梅爾頻率倒譜係數(MFCCs) 8 2.3.2 梅爾頻譜圖(Mel-Spectrogram) 11 2.3.3 色譜圖(Chromagram) 13 2.3.4 音量 15 2.3.5 音高 16 2.3.6 音色 16 2.3.7 共振峰 17 2.4 遷移學習 17 2.5 卷積神經網路 18 2.6 模型評估 19 第三章 實驗方法 22 3.1 語音資料庫 22 3.1.1 IEMOCAP資料庫 22 3.1.2 RAVDESS資料庫 24 3.1.3 創建中文語音情緒資料庫(NtustACK) 26 3.2 語音資料擴增 28 3.3 語音資料前處理 32 3.3.1 端點偵測 33 3.3.2 特徵提取 34 3.4 建立預訓練模型 37 3.5 建立微調模型 40 第四章 實驗結果與討論 43 4.1 預訓練模型實驗結果 43 4.1.1 特徵準確率比較 43 4.1.2 資料擴增準確率比較 52 4.2 微調模型實驗結果 53 4.2.1 遷移學習結果(IEMOCAP遷移學習至NtustACK) 53 4.2.2 遷移學習結果(RAVDESS遷移學習至NtustACK) 56 4.3遷移學習結果討論 58 4.4 額外資料模型測試結果 59 第五章 結論與未來研究方向 61 5.1 結論 61 5.2 未來展望 62 參考文獻 63

    [1] R. Cowie et al., "Emotion recognition in human-computer interaction," IEEE Signal Process. Mag., Review vol. 18, no. 1, pp. 32-80, Jan 2001, doi: 10.1109/79.911197.
    [2] A. Jaimes and N. Sebe, "Multimodal human-computer interaction: A survey," Comput. Vis. Image Underst., Review vol. 108, no. 1-2, pp. 116-134, Oct-Nov 2007, doi: 10.1016/j.cviu.2006.10.019.
    [3] 聲藝編輯. "大數據時代來臨,隱藏在話語中的心情AI都知道! 「嘿Siri!」語音識別功能的再升級 人類複雜又細膩的情緒變化." https://metavoice.tw/articles/vocal-emotion-recognition (2023, 03, 01).
    [4] P. Ekman, "An argument for basic emotions," Cognition & emotion, vol. 6, no. 3-4, pp. 169-200, 1992.
    [5] K. R. Scherer and H. Ellgring, "Multimodal expression of emotion: Affect programs or componential appraisal patterns?," Emotion, vol. 7, no. 1, pp. 158-171, Feb 2007, doi: 10.1037/1528-3542.7.1.158.
    [6] Z. K. Abdul and A. K. Al-Talabani, "Mel Frequency Cepstral Coefficient and its Applications: A Review," IEEE Access, Review vol. 10, pp. 122136-122158, 2022, doi: 10.1109/access.2022.3223444.
    [7] J. R. Fontaine, K. R. Scherer, E. B. Roesch, and P. C. Ellsworth, "The world of emotions is not two-dimensional," Psychological science, vol. 18, no. 12, pp. 1050-1057, 2007.
    [8] R. Jahangir, Y. W. Teh, G. Mujtaba, R. Alroobaea, Z. H. Shaikh, and I. Ali, "Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion," Machine Vision and Applications, vol. 33, no. 3, p. 41, 2022.
    [9] D. Hazarika, S. Poria, R. Zimmermann, and R. Mihalcea, "Conversational transfer learning for emotion recognition," Inf. Fusion, vol. 65, pp. 1-12, Jan 2021, doi: 10.1016/j.inffus.2020.06.005.
    [10] D. Nguyen et al., "Meta-transfer learning for emotion recognition," Neural Comput. Appl., vol. 35, no. 14, pp. 10535-10549, May 2023, doi: 10.1007/s00521-023-08248-y.
    [11] A. Varga and H. J. M. Steeneken, "ASSESSMENT FOR AUTOMATIC SPEECH RECOGNITION .2. NOISEX-92 - A DATABASE AND AN EXPERIMENT TO STUDY THE EFFECT OF ADDITIVE NOISE ON SPEECH RECOGNITION SYSTEMS," Speech Commun., vol. 12, no. 3, pp. 247-251, Jul 1993, doi: 10.1016/0167-6393(93)90095-3.
    [12] M. Cooke, J. Barker, S. Cunningham, and X. Shao, "An audio-visual corpus for speech perception and automatic speech recognition (L)," J. Acoust. Soc. Am., vol. 120, no. 5, pp. 2421-2424, Nov 2006, doi: 10.1121/1.2229005.
    [13] 李其芳,「基於殘差網路的語音情緒辨識—使用梅爾特徵與色譜圖,」 碩士論文, 電機工程學系, 國立陽明交通大學, 新竹市, 2022. Available: https://hdl.handle.net/11296/649h9g
    [14] A. A. Anthony and C. M. Patil, "Speech Emotion Recognition Systems: A Comprehensive Review on Different Methodologies," Wirel. Pers. Commun., vol. 130, Review; Early Access p. 11, 2023 Mar 2023, doi: 10.1007/s11277-023-10296-5.
    [15] J. de Lope and M. Grana, "An ongoing review of speech emotion recognition," Neurocomputing, Review vol. 528, pp. 1-11, Apr 2023, doi: 10.1016/j.neucom.2023.01.002.
    [16] 張傑智. "電腦與通訊 語音情緒辨識概述." https://jictcms.itri.org.tw/xcdoc/cont?xsmsid=0M236556470056558161&qcat=0M236615929154970794&sid=0M256565184351531469 (2023, 06, 23).
    [17] 周維忠. "從單一資料源挺進多模態 解析判讀六種基本情緒 情緒辨識AI應用如火如荼 全盤考量才能避免誤判." https://www.netadmin.com.tw/netadmin/zh-tw/trend/4529C9A1EB0D414CBD723B2BC10C5C24 (2021, 02, 03).
    [18] J. Hirschberg and C. D. Manning, "Advances in natural language processing," Science, Review vol. 349, no. 6245, pp. 261-266, Jul 2015, doi: 10.1126/science.aaa8685.
    [19] D. Hazarika, S. Poria, R. Zimmermann, and R. Mihalcea, "Emotion recognition in conversations with transfer learning from generative conversation modeling," arXiv preprint arXiv:1910.04980, 2019.
    [20] D. Ververidis and C. Kotropoulos, "Emotional speech recognition: Resources, features, and methods," Speech Commun., Review vol. 48, no. 9, pp. 1162-1181, Sep 2006, doi: 10.1016/j.specom.2006.04.003.
    [21] M. El Ayadi, M. S. Kamel, and F. Karray, "Survey on speech emotion recognition: Features, classification schemes, and databases," Pattern Recognit., vol. 44, no. 3, pp. 572-587, Mar 2011, doi: 10.1016/j.patcog.2010.09.020.
    [22] 張智星. "Audio Signal Processing." https://www.youtube.com/playlist?list=PLwb75-SEfInnOgN5bTtxah2dMrtrbInON (2017, 01, 03).
    [23] M. C. Sezgin, B. Gunsel, and G. K. Kurt, "Perceptual audio features for emotion detection," EURASIP J. Audio Speech Music Process.,vol. 2012, p. 21, 2012, Art no. 16, doi: 10.1186/1687-4722-2012-16.
    [24] J. F. Zhao, X. Mao, and L. J. Chen, "Speech emotion recognition using deep 1D & 2D CNN LSTM networks," Biomed. Signal Process. Control, vol. 47, pp. 312-323, Jan 2019, doi: 10.1016/j.bspc.2018.08.035.
    [25] D. Issa, M. F. Demirci, and A. Yazici, "Speech emotion recognition with deep convolutional neural networks," Biomed. Signal Process. Control, vol. 59, p. 11, May 2020, Art no. 101894, doi: 10.1016/j.bspc.2020.101894.
    [26] R. D. Melara and L. E. Marks, "INTERACTION AMONG AUDITORY DIMENSIONS - TIMBRE, PITCH, AND LOUDNESS," Percept. Psychophys., vol. 48, no. 2, pp. 169-178, Aug 1990, doi: 10.3758/bf03207084.
    [27] R. Li and M. Zhang, "Singing-Voice Timbre Evaluations Based on Transfer Learning," Applied Sciences, vol. 12, no. 19, p. 9931, 2022.
    [28] Y. Zhou, Y. Sun, J. Zhang, and Y. Yan, "Speech emotion recognition using both spectral and prosodic features," in 2009 international conference on information engineering and computer science, 2009: IEEE, pp. 1-4.
    [29] C. Luna-Jimenez, D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero, and F. Fernandez-Martinez, "Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning," Sensors, vol. 21, no. 22, p. 29, Nov 2021, Art no. 7665, doi: 10.3390/s21227665.
    [30] P. Song, Y. Jin, L. Zhao, and M. H. Xin, "Speech Emotion Recognition Using Transfer Learning," IEICE Trans. Inf. Syst., vol. E97D, no. 9, pp. 2530-2532, Sep 2014, doi: 10.1587/transinf.2014EDL8038.
    [31] K. X. Feng and T. Chaspari, "A Review of Generalizable Transfer Learning in Automatic Emotion Recognition," Front. Comput. Sci.-Switz, Review vol. 2, p. 14, Feb 2020, Art no. 9, doi: 10.3389/fcomp.2020.00009.
    [32] R. Yacouby and D. Axman, "Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models," in Proceedings of the first workshop on evaluation and comparison of NLP systems, 2020, pp. 79-91.
    [33] S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English," PloS one, vol. 13, no. 5, p. e0196391, 2018.
    [34] B. B. Al-onazi, M. A. Nauman, R. Jahangir, M. M. Malik, E. H. Alkhammash, and A. M. Elshewey, "Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion," Appl. Sci.-Basel, vol. 12, no. 18, p. 17, Sep 2022, Art no. 9188, doi: 10.3390/app12189188.
    [35] D. S. Park et al., "Specaugment: A simple data augmentation method for automatic speech recognition," arXiv preprint arXiv:1904.08779, 2019.
    [36] W.-N. Hsu, Y. Zhang, and J. Glass, "Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation," in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017: IEEE, pp. 16-23.
    [37] T. A. M. Celin, P. Vijayalakshmi, and T. Nagarajan, "Data Augmentation Techniques for Transfer Learning-Based Continuous Dysarthric Speech Recognition," Circuits Syst. Signal Process., vol. 42, no. 1, pp. 601-622, Jan 2023, doi: 10.1007/s00034-022-02156-7.
    [38] Q. Li, J. Zheng, A. Tsai, and Q. Zhou, "Robust endpoint detection and energy normalization for real-time speech and speaker recognition," IEEE Transactions on Speech and Audio Processing, vol. 10, no. 3, pp. 146-157, 2002.
    [39] B. McFee et al., "librosa: Audio and music signal analysis in python," in Proceedings of the 14th python in science conference, 2015, vol. 8, pp. 18-25.
    [40] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.

    無法下載圖示
    全文公開日期 2026/08/16 (校外網路)
    全文公開日期 2026/08/16 (國家圖書館:臺灣博碩士論文系統)
    QR CODE