簡易檢索 / 詳目顯示

研究生: 范晉桓
Jin-Huan Fan
論文名稱: 以Transformer與Conformer為基礎能自動標註標點符號的醫療語音辨識技術
Online Medical Speech Recognition with Punctuation by Transformer and Conformer Deep Learning Networks
指導教授: 鍾聖倫
Sheng-Luen Chung
口試委員: 鍾聖倫
Sheng-Luen Chung
方文賢
Wen-Hsien Fang
陳柏琳
Berlin Chen
廖元甫
Yuan-Fu Liao
丁賢偉
Hsien-Wei Ting
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 87
中文關鍵詞: 語音辨識中文醫療資料庫深度學習MVC網頁開發
外文關鍵詞: deep learning, speech recognition, Chinese medical speech corpus, MVC
相關次數: 點閱:203下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 自動語音辨識 (ASR,Automatic Speech Recognition) 技術於醫護情境中的應用將有助於醫師及護理人員快速進行病患診斷紀錄、病例紀錄、以及術後照護與巡房紀錄等。然而,以往語音辨識沒有標點符號,導致輸出內容可讀性降低,而且因專有名詞緣所導致較高的辨識錯誤率也有礙於實際整合的效率。據此,本論文的貢獻在於:(1) 提供包含標點符號標註的中文醫療語料庫 (psChiMeS-14),其為從衛福部台北醫院中收集15位專業護理人員所錄製的516份病歷表,並依照所訂定的標點符號標準進行人工標註,共計867分鐘,可適用於一般端對端的ASR模型。(2) 使用以自我注意力模型為基礎的Transformer以及加入Convolution機制的Conformer的兩種語音辨識解決方案。利用psChiMeS-14語料庫進行訓練與測試分別取得:CER為13.1% 和10.5%,以及 KER為 17.22% 和 13.10%,相較於以注意力模型為基礎之Joint CTC/Attention架構所得CER為 15.70%,以及 KER 為22.50%。配合其針對專業醫療語音的訓練,新的解決旁案可適用於一般住院病歷內容的語音辨識。(3) 建構一個線上醫療語音辨識系統 (Online Medical Speech Recognition),此系統是在Python環境下使用MVC架構進行開發,透過後端API的撰寫,將網頁前端介面與後端語音辨識模型進行串接而成。此整合系統功能包含可即時錄音辨識、上傳病歷表音檔辨識,以及病歷上傳紀錄,可供醫療相關人員進行應用,加快文書處理速度。


    Automatic Speech Recognition (ASR) technology applied in medical and nursing settings helps medical professionals to record patient record and diagnosis, case records, postoperative care, and patrol records more efficiently. However, speech recognition tends to ignore punctuation, resulting in reduced readability of the output transcript, and the higher recognition error rate due to abundant medical terminology also its efficacy in final integration. Accordingly, the contributions of this paper consist of: (1) A punctuated Chinese medical corpus psChiMeS-14, which is a collection of 516 medical records in a total of a total of 867 minutes recorded by 15 professional nursing staff from Taipei Hospital of the Ministry of Health and Welfare. The speech corpus are manually labeled with punctuation marks containing columns, commas, and periods, ready for general end-to-end ASR models. (2) Two self-attention based speech recognition solutions: on by transformer and the other by conformer networks. Trained by and tested on psChiMeS-14 corpus, these two solutions deliver state-of-the-art recognition performance: CER (character error rate) of 13.1% and 10.5%, and KER (Keyword error rate) of 17.22% and 13.10%, respectively, which is contrasted to the CER of 15.70% and the KER of 22.50% by the Joint CTC/Attention architecture. The solutions trained by pChiMeS-14 can also be applied to the speech recognition of general Chinese hospital medical records. (3) An online medical speech recognition system, which is developed in a python environment using MVC framework. Through API interfacing, the front-end web user interface and the back-end speech recognition model are integrated seamlessly. This integrated system facilitates the most encountered paper processing in a hospital, including real-time speech recognition, recognition of recorded speech files, and uploading and retrieval of recognized records.

    摘要 I Abstract II 致謝 IV 目錄 V 圖目錄 VIII 表目錄 IX 第一章、 簡介 1 1.1 研究動機 1 1.2 困難與挑戰 1 1.3 本研究貢獻 2 第二章、相關文獻 4 2.1 自動語音辨識 (Automatic Speech Recognition,ASR) 5 2.2 自動化標點符號標註 (Punctuation Restoration) 11 2.3 Python網站開發 (Python Web Development) 12 第三章、具標點中文醫療語音資料庫 14 3.1 病歷表組成與標註切分方式 14 3.1.1 病歷表組成 14 3.1.2 語音標註方式與切分 15 3.2 標點符號標註規則 17 3.3 具標點中文醫療語音資庫 (psChiMeS-14) 18 第四章、實驗方法 20 4.1 ASR使用相同音源但具有標點符號的Label訓練 20 4.2 回顧Joint CTC-Attention 模型 22 4.3 Transformer 語音辨識模型 23 4.3.1 位置編碼 (Positional Encoding) 25 4.3.2 編碼器-多頭式注意力機制 (Multi-Head Attention) 25 4.3.3 前饋式網路 (Feed-Forward Networks) 27 4.3.4 解碼器-具遮罩之多頭式注意力機制 (Masked Multi-Head Attention) 28 4.3.5 Self-Attention 解碼器損失函數 28 4.3.6 CTC 解碼器 (CTC Decoder) 29 4.3.7 共同解碼機制 (Joint Decoding) 30 4.4 Conformer 語音辨識模型 31 4.4.1 改良式前饋式網路 (Improved Feed-Forward Networks) 32 4.4.2 編碼器-卷積模組 (Convolution Module) 33 4.4.3 共同解碼機制 (Joint Decoding) 34 第五章、實驗結果 36 5.1 評測指標 37 5.1.1 字符錯誤率 (Character Error Rate,CER) 37 5.1.2 句錯誤率 (Sentence Error Rate,SER) 37 5.1.3 關鍵詞錯誤率 (Keyword Error Rate,KER) 37 5.2 實驗結果 38 5.2.1 sChiMeS-14上不同架構參數與效能比較 38 5.2.2 Self-Attention Based模型在psChiMeS-14上的效能比較 42 5.2.3 Self-Attention模型在sChiMeS-14和psChiMeS-14上的實驗結果與探討 46 5.2.4 Joint CTC-Attention 與 Conformer ASR在psChiMeS-14實驗比較 47 第六章、 線上醫療語音辨識系統 51 6.1 系統架構與環境 (System Architecture and Environment) 51 6.2 功能介紹與操作流程 52 6.3 使用者介面 (User Interface) 53 第七章、結論與未來研究方向 55 7.1 結論 55 7.2 未來研究方向 56 參考文獻 57 附錄 61 A. 中文與英文對照字彙表 61 B. 關鍵字增量之醫療常用關鍵詞彙 63 C. 醫療語料庫之中文與英文字類別 67 口試委員之建議與回覆 74

    [1] S. Scheidenhelm, "Nurses' Perceptions of the Impact of Electronic Health Records
    on Work and Patient Outcomes," Computers, informatics, nursing : CIN, vol. 26,
    pp. 69-77, 03/01 2008.
    [2] L. Block, R. Habicht, A. Wu, S. Desai, K. Wang, K. Silva, T. Niessen, N. Oliver,
    and L. Feldman, “In the wake of the 2003 and 2011 duty hours regulations, how
    do internal medicine interns spend their time?,”Journal of General Internal
    Medicine, vol. 28, no. 8, pp. 1042–1047, 2013
    [3] T. H. Payne, W. D. Alonso, J. A. Markiel, K. Lybarger, and A. A. White, "Using
    voice to create hospital progress notes: Description of a mobile application and
    supporting system integrated with a commercial electronic health record," Journal
    of Biomedical Informatics, Article vol. 77, pp. 91-96, 2018.
    [4] B. Garg and Anika, “Analysis of Punctuation Prediction Models for Automated
    Transcript Generation in MOOC Videos,” in Proceedings of the 6th IEEE
    International Conference on MOOCS Innovation and Technology In Education,
    MITE 2018, 2018, pp. 19-26.
    [5] R. Zhang and C. Zhang, “Dynamic Sentence Boundary Detection for
    Simultaneous Translation,” in Proceedings of the First Workshop on Automatic
    Simultaneous Translation, Seattle, Washington, 2020, pp. 1-9.
    [6] D. Li, N. Arivazhagan, C. Cherry, and D. Padfield, “Sentence Boundary
    Augmentation For Neural Machine Translation Robustness,” arXiv preprint
    arXiv:2010.11132, 2020.
    [7] D. Su, X. Wu, and L. Xu, “Gmm-hmm acoustic model training by a two level
    procedure with gaussian components determined by automatic model selection,”
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal
    Processing - Proceedings, pp. 4890–4893,
    [8] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal
    classification: Labelling unsegmented sequence data with recurrent neural
    networks," 2006, vol. 148, pp. 369-376.
    [9] L. Lu, X. Zhang, K. Cho, and S. Renals, "A study of the recurrent neural network
    encoder-decoder for large vocabulary speech recognition," in Sixteenth Annual
    Conference of the International Speech Communication Association, 2015.
    58
    [10] W. Chan, N. Jaitly, Q. V. Le, and O. J. a. p. a. Vinyals, "Listen, attend and spell,"
    2015.
    [11] S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech
    recognition using multi-task learning," in 2017 IEEE international conference on
    acoustics, speech and signal processing (ICASSP), 2017, pp. 4835-4839: IEEE.
    [12] 陳. 柏琳, "探究端對端混合模型架構於華語語音辨識," 中文計算語言學期刊,
    文章 vol. 24, no. 1, 2019.
    [13] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information
    Processing Systems, 2017, vol. 2017-December, pp. 5999-6009.
    [14] L. Dong, S. Xu, and B. Xu, "Speech-Transformer: A No-Recurrence Sequence-to-
    Sequence Model for Speech Recognition," in 2018 IEEE International Conference
    on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5884-5888.
    [15] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence-to-sequence speech
    recognition with the transformer in Mandarin Chinese,” in Proceedings of the
    Annual Conference of the International Speech Communication Association,
    INTERSPEECH, 2018, vol. 2018-September, pp. 791-795.
    [16] S. Karita et al., "A Comparative Study on Transformer vs RNN in Speech
    Applications," in 2019 IEEE Automatic Speech Recognition and Understanding
    Workshop, ASRU 2019 - Proceedings, 2019, pp. 449-456.
    [17] H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, “Transformer-Based Online
    CTC/Attention End-To-End Speech Recognition Architecture,” in ICASSP, IEEE
    International Conference on Acoustics, Speech and Signal Processing -
    Proceedings, 2020, vol. 2020-May, pp. 6084-6088.
    [18] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech
    recognition," in Proceedings of the Annual Conference of the International Speech
    Communication Association, INTERSPEECH, 2020, vol. 2020-October, pp. 5036-
    5040.
    [19] W. Li, J. Qin, C. C. Chiu, R. Pang, and Y. He, "Parallel rescoring with transformer
    for streaming on-device speech recognition," in Proceedings of the Annual
    Conference of the International Speech Communication Association,
    INTERSPEECH, 2020, vol. 2020-October, pp. 2122-2126.
    [20] Z. Yao et al., "WeNet: Production oriented Streaming and Non-streaming End-to-
    End Speech Recognition Toolkit," p. arXiv: 2102.01547, 2021.
    59
    [21] W. Salloum, G. Finley, E. Edwards, M. Miller, and D. Suendermann-Oeft, “Deep
    Learning for Punctuation Restoration in Medical Reports,” in BioNLP 2017,
    Vancouver, Canada, 2017, pp. 159-164.
    [22] M. Sunkara, S. Ronanki, K. Dixit, S. Bodapati, and K. Kirchhoff, “Robust
    prediction of punctuation and truecasing for medical asr,” in Proceedings of the
    First Workshop on Natural Language Processing for Medical Conversations, 2020,
    pp. 53-62.
    [23] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep
    bidirectional transformers for language understanding,” in NAACL HLT 2019 -
    2019 Conference of the North American Chapter of the Association for
    Computational Linguistics: Human Language Technologies - Proceedings of the
    Conference, 2019, vol. 1, pp. 4171-4186.
    [24] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ErniE: Enhanced
    language representation with informative entities,” in ACL 2019 - 57th Annual
    Meeting of the Association for Computational Linguistics, Proceedings of the
    Conference, 2020, pp. 1441-1451.
    [25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a Method for Automatic
    Evaluation of Machine Translation," in Proceedings of the 40th Annual Meeting
    of the Association for Computational Linguistics, Philadelphia, Pennsylvania,
    USA, 2002, pp. 311-318: Association for Computational Linguistics.
    [26] E. Cho, J. Niehues, and A. Waibel, “NMT-based segmentation and punctuation
    insertion for real-Time spoken language translation,” in Proceedings of the Annual
    Conference of the International Speech Communication Association,
    INTERSPEECH, 2017, vol. 2017-August, pp. 2645-2649.
    [27] “Flask Framework” Available from: https://flask.palletsprojects.com/en/2.0.x/
    [28] M. R. Mufid, A. Basofi, M. U. H. Al Rasyid, I. F. Rochimansyah, and A. Rokhim,
    "Design an MVC Model using Python for Flask Framework Development," in IES
    2019 - International Electronics Symposium: The Role of Techno-Intelligence in
    Creating an Open Energy System Towards Energy Democracy, Proceedings, 2019,
    pp. 214-219.
    [29] “Deploying PyTorch In Python Via a REST API with Flassk” Available from:
    https://pytorch.org/tutorials/intermediate/flask_rest_api_tutorial.html
    [30] “ELAN annotation tool” Available from: https://archive.mpi.nl/tla/elan
    60
    [31] “教育部《重訂標點符號手冊》修訂版” Available from:
    https://language.moe.gov.tw/001/upload/files/site_content/m0001/hau/haushou.ht
    m#suo
    [32] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for Activation Functions,"
    arXiv e-prints, p. arXiv:1710.05941, 2017.
    [33] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, "Lite Transformer with Long-Short
    Range Attention," arXiv e-prints, p. arXiv:2004.11886, 2020.
    [34] S. Watanabe et al., "ESPNet: End-to-end speech processing toolkit," in
    Proceedings of the Annual Conference of the International Speech Communication
    Association, INTERSPEECH, 2018, vol. 2018-September, pp. 2207-2211.
    [35] “espnet/espnet, github code. ” Available from: https://github.com/espnet/espnet
    [36] “Nginx web server .” Available from: http://nginx.org/
    [37] “Mongo DB” Available from: https://www.mongodb.com/
    [38] W. Salloum, G. Finley, E. Edwards, M. Miller, and D. Suendermann-Oeft,
    "Automated preamble detection in dictated medical reports," in BioNLP 2017,
    2017, pp. 287-295.

    無法下載圖示 全文公開日期 2024/08/12 (校內網路)
    全文公開日期 2026/08/12 (校外網路)
    全文公開日期 2026/08/12 (國家圖書館:臺灣博碩士論文系統)
    QR CODE