簡易檢索 / 詳目顯示

研究生: 邱馨怡
Hsin-Yi Chiu
論文名稱: 中文醫療語音辨識:語音語料庫之建置 與端對端之辨識技術
Recognition of Chinese Medical Speech: Construction of Speech Corpus and an End-to-End Solution by Joint CTC-Attention Model
指導教授: 鍾聖倫
Sheng-Luen Chung
口試委員: 鍾聖倫
Sheng-Luen Chung
蘇順豐
Shun-Feng Su
郭重顯
Chung-Hsien Kuo
方文賢
Wen-Hsien Fang
徐繼聖
Gee-Sern Hsu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 87
中文關鍵詞: 語音辨識中文醫療語料庫資料增量深度學習
外文關鍵詞: Speech recognition, Chinese medical speech corpus, Data augmentation, Deep learning
相關次數: 點閱:157下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 醫護專業人員花費大量精力和時間在文書處理上,用以紀錄患者的資訊。醫療語音辨識有助於醫療專業人員進行病歷登載、巡房記錄、診斷追蹤等。隨著深度學習的迅速發展,語音識別已取得了巨大進步,而深度學習將端對端訓練的結構取代了傳統語言辨識的流程,不需要將聲學模型、音節模型以及語音模型各別訓練,得到了最先進的效能。然而,醫療語音具有專業特殊的術語及專業從業人員的語調等特性,通過自動語音辨識系統(ASR)進行醫療語音識別的效果不盡人意,這可以歸因於缺乏用於訓練語音辨識系統的醫療語音語料庫。為促進中文醫療語音辨識的發展,本論文的貢獻在於:(1)提出中文醫療語料庫(ChiMeS),從台北醫院中收集由十五名專業醫護人員講述的517份匿名患者病歷表,共計855分鐘;(2)對於醫療語音辨識而言,我們最關心的是這些重要的醫療相關詞彙不可以出錯,因此,我們提出關鍵字錯誤率(KER),計算關鍵字被錯誤辨識的結果作為評判標準,以突出辨識句子中的醫學術語;(3)使用Deep Speech 2 及Joint CTC/Attention的解決方案,當使用Joint CTC/Attention加上所有資料增量方法後,可以得到WER為15.05%,KER為7.54%,可以與目前在英文上使用270小時語料庫的語音識別解決方案比擬。我們將建立了一個中文醫療語音網站,分享時間長度為14小時的加註醫療語料庫,包含訓練與測試資料庫協定,及提供由Deep Speech 2經過訓練的模型作為基線解決方案。此外,我們還設立了具有挑戰性的ChiMeS-14的競賽平台提供學界研究,研究人員可以提交其辨識結果以進行公平的評估。


    Medical professionals spend a significant portion of energy and time keeping patients records for diagnosis and tracking purpose. Speech recognition has gained tremendous progress with the rapid advancement of deep learning that replaced traditional recognition pipeline with end-to-end trainable structure, delivering state-of-the-art performance. However, recognition of medical speech by general automatic speech recognition (ASR) are less than satisfactory, which can be attributed to the lack of medical speech corpus to train ASR. To promote the development of Chinese medical speech recognition, this thesis contributes by: (1) proposing Chinese Medical Speech Corpus (ChiMeS), which is the collection of read-outs by fifteen medical professionals of 517 anonymized patient records from Taipei Hospital, totaling 855 minutes; (2) presenting a more relevant performance criterion, keyword error rate (KER), that highlights medical terminology terms in the speech utterances; and (3) providing solutions with either Deep Speech 2 or Joint CTC/Attention models. Joint CTC/Attention model delivers a WER of 15.05% and KER of 7.54%, superior to reported medical speech recognition solutions. A website is set up to conditional release the annotated 14-hour ChiMeS with training/testing protocol, together with a trained baseline solution by DeepSpeech2 trained by ChiMeS-5 with data augmentation, together with a competition platform for the challenging ChiMeS-14 where researchers can submit their recognition results for fair evaluation.

    摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Data Augmentation for Speech . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Keyword Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 3: Chinese Medical Speech Corpus (ChiMeS) . . . . . . . . . . . 10 3.1 Medical Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 Corpus Collection and Annotation . . . . . . . . . . . . . . . . . 11 3.1.2 Dataset Partition and Training/Testing Protocols . . . . . . . . . 14 3.2 Baseline Solution by Deep Speech 2 . . . . . . . . . . . . . . . . . . . . 15 3.3 Experiments by Deep Speech 2 . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 System Setup and Training . . . . . . . . . . . . . . . . . . . . . 17 IV 3.3.2 Ablation Test with Baseline Architecture . . . . . . . . . . . . . 17 Chapter 4: Joint CTC/Attention model and Additional Data Augmentation 19 4.1 Overview Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Joint CTC/Attention Model . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.1 CTC Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.2 Attention Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.3 Joint Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.1 Wave Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.2 Spectrogram Augmentation . . . . . . . . . . . . . . . . . . . . 30 4.3.3 Keyword Augmentation . . . . . . . . . . . . . . . . . . . . . . 31 Chapter 5: Experiment Setups and Results . . . . . . . . . . . . . . . . . 34 5.1 System Setup and Training . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 Experiment Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.1 Ablation Test with Different Architectures . . . . . . . . . . . . . 35 5.3.2 Ablation Test with Different Corpus Sizes . . . . . . . . . . . . . 41 5.3.3 Correct Identification of Keywords . . . . . . . . . . . . . . . . 45 5.3.4 Cross-domain Test . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 V Appendix A: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Appendix B: Dictionary of ChiMeS-14 . . . . . . . . . . . . . . . . . . . . 59 Appendix C: Dictionary Used for Keyword Augmentation . . . . . . . . . . 69

    [1] L. Block, R. Habicht, A. Wu, S. Desai, K. Wang, K. Silva, T. Niessen, N. Oliver, and L. Feldman, “In the wake of the 2003 and 2011 duty hours regulations, how do internal medicine interns spend their time?,” Journal of General Internal Medicine, vol. 28, no. 8, pp. 1042–1047, 2013.
    [2] S. Kossman and S. Scheidenhelm, “Nurses’ perceptions of the impact of electronic health records on work and patient outcomes,” CIN - Computers Informatics Nursing, vol. 26, no. 2, pp. 69–77, 2008.
    [3] D. Su, X. Wu, and L. Xu, “Gmm-hmm acoustic model training by a two level procedure with gaussian components determined by automatic model selection,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 4890–4893, 2010.
    [4] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, pp. 82–97, Nov 2012.
    [5] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” ACM International Conference Proceeding Series, vol. 148, pp. 369–376, 2006.
    [6] L. Lu, X. Zhang, K. Cho, and S. Renals, “A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2015-January, pp. 3249–3253, 2015.
    [7] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2016-May, pp. 4960–4964, 2016.
    [8] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multitask learning,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 4835–4839, 2017.
    [9] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2017-August, pp. 949–953, 2017.
    [10] T. Zhu and C. Cheng, “Joint ctc-attention end-to-end speech recognition with a triangle recurrent neural network encoder,” Journal of Shanghai Jiaotong University (Science), vol. 25, no. 1, pp. 70–75, 2020.
    [11] R. Li, X. Wang, S. Mallidi, S. Watanabe, T. Hori, and H. Hermansky, “Multi-stream end-to-end speech recognition,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 28, pp. 646–655, 2020.54
    [12] C.-X. Qin, W.-L. Zhang, and D. Qu, “A new joint ctc-attention-based speech recognition model with multi-level multi-head attention,” Eurasip Journal on Audio, Speech, and Music Processing, vol. 2019, no. 1, 2019.
    [13] N. Moritz, T. Hori, and J. Roux, “Streaming end-to-end speech recognition with joint ctc-attention based models,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings, pp. 936–943, 2019.
    [14] H. Miao, G. Cheng, P. Zhang, T. Li, and Y. Yan, “Online hybrid ctc/attention architecture for end-toend speech recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-September, pp. 2623–2627, 2019.
    [15] M. Gales, A. Ragni, H. AlDamarki, and C. Gautier, “Support vector machines for noise robust asr,” Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2009, pp. 205–210, 2009.
    [16] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2015-January, pp. 3586–3589, 2015.
    [17] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
    [18] Y. Zhang, N. Suda, L. Lai, and V. Chandra, “Hello edge: Keyword spotting on microcontrollers,” arXiv preprint arXiv:1711.07128, 2017.
    [19] A. Shokri, M. Davarpour, and A. Akbari, “Improving keyword detection rate using a set of rules to merge hmm-based and svm-based keyword spotting results,” Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, pp. 1715–1718, 2014.
    [20] I. Rebai, Y. Ayed, and W. Mahdi, “Spoken keyword search system using improved asr engine and novel template-based keyword scoring,” Multimedia Tools and Applications, vol. 78, no. 2, pp. 1495–1510, 2019.
    [21] H. Fujimura, N. Ding, D. Hayakawa, and T. Kagoshima, “Simultaneous flexible keyword detection and text-dependent speaker recognition for low-resource devices,” ICPRAM 2020 - Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods, pp. 297–307, 2020.
    [22] D. Hanauer, “Emerse: The electronic medical record search engine.,” AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, p. 941, 2006.
    [23] “Southern medical university, (2017). english-chinese parallel corpus of medical.” [Search engine]. Available from: http://www.e-charm.com.cn/ylk.asp.55
    [24] E. Edwards, W. Salloum, G. Finley, J. Fone, G. Cardiff, M. Miller, and D. Suendermann-Oeft, “Medical speech recognition: Reaching parity with humans,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10458 LNAI, pp. 512–524, 2017.
    [25] M. Qorib and M. Adriani, “Building medisco: Indonesian speech corpus for medical domain,” Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018, pp. 133–138, 2019.
    [26] N. Mana, R. Cattoni, E. Pianta, F. Rossi, F. Pianesi, and S. Burger, “The italian nespole! corpus: A multilingual database with interlingua annotation in tourism and medical domains,” Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004, pp. 1467–1470, 2004.
    [27] A. Renato, H. Berinsky, M. Daus, M. Dachery, O. Jauregui, F. Storani, M. Gambarte, C. Otero, and D. Luna, “Design and evaluation of an automatic speech recognition model for clinical notes in spanish in a mobile online environment,” Studies in Health Technology and Informatics, vol. 264, pp. 1761–1762, 2019.
    [28] “Chinese medical speech corpus (chimes), medical speech corpus.” Available from: https://drive.google.com/drive/folders/1Fwkd5Opu3ewib7fHw2Pwer0-WGb8n3AE?usp=sharing, 2020.
    [29] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. Legresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang,
    A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, “Deep speech 2: End-to-end speech recognition in english and mandarin,” 33rd International Conference on
    Machine Learning, ICML 2016, vol. 1, pp. 312–321, 2016.
    [30] T. Hori, S. Watanabe, and J. Hershey, “Joint ctc/attention decoding for end-to-end speech recognition,” ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), vol. 1, pp. 518–529, 2017.
    [31] A. Ray, Y. Shen, and H. Jin, “Learning out-of-vocabulary words in intelligent personal agents,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2018-July, pp. 4309–4315, 2018.
    [32] “Sox, audio manipulation tool.” Available from: http://sox.sourceforge.net/.
    [33] “Alexander-h-liu/ end-to-end-asr-pytorch, github code.” Available from: https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch.

    無法下載圖示 全文公開日期 2025/08/20 (校內網路)
    全文公開日期 2025/08/20 (校外網路)
    全文公開日期 2025/08/20 (國家圖書館:臺灣博碩士論文系統)
    QR CODE