研究生: |
范晉桓 Jin-Huan Fan |
---|---|
論文名稱: |
以Transformer與Conformer為基礎能自動標註標點符號的醫療語音辨識技術 Online Medical Speech Recognition with Punctuation by Transformer and Conformer Deep Learning Networks |
指導教授: |
鍾聖倫
Sheng-Luen Chung |
口試委員: |
鍾聖倫
Sheng-Luen Chung 方文賢 Wen-Hsien Fang 陳柏琳 Berlin Chen 廖元甫 Yuan-Fu Liao 丁賢偉 Hsien-Wei Ting |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 87 |
中文關鍵詞: | 語音辨識 、中文醫療資料庫 、深度學習 、MVC網頁開發 |
外文關鍵詞: | deep learning, speech recognition, Chinese medical speech corpus, MVC |
相關次數: | 點閱:206 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
自動語音辨識 (ASR,Automatic Speech Recognition) 技術於醫護情境中的應用將有助於醫師及護理人員快速進行病患診斷紀錄、病例紀錄、以及術後照護與巡房紀錄等。然而,以往語音辨識沒有標點符號,導致輸出內容可讀性降低,而且因專有名詞緣所導致較高的辨識錯誤率也有礙於實際整合的效率。據此,本論文的貢獻在於:(1) 提供包含標點符號標註的中文醫療語料庫 (psChiMeS-14),其為從衛福部台北醫院中收集15位專業護理人員所錄製的516份病歷表,並依照所訂定的標點符號標準進行人工標註,共計867分鐘,可適用於一般端對端的ASR模型。(2) 使用以自我注意力模型為基礎的Transformer以及加入Convolution機制的Conformer的兩種語音辨識解決方案。利用psChiMeS-14語料庫進行訓練與測試分別取得:CER為13.1% 和10.5%,以及 KER為 17.22% 和 13.10%,相較於以注意力模型為基礎之Joint CTC/Attention架構所得CER為 15.70%,以及 KER 為22.50%。配合其針對專業醫療語音的訓練,新的解決旁案可適用於一般住院病歷內容的語音辨識。(3) 建構一個線上醫療語音辨識系統 (Online Medical Speech Recognition),此系統是在Python環境下使用MVC架構進行開發,透過後端API的撰寫,將網頁前端介面與後端語音辨識模型進行串接而成。此整合系統功能包含可即時錄音辨識、上傳病歷表音檔辨識,以及病歷上傳紀錄,可供醫療相關人員進行應用,加快文書處理速度。
Automatic Speech Recognition (ASR) technology applied in medical and nursing settings helps medical professionals to record patient record and diagnosis, case records, postoperative care, and patrol records more efficiently. However, speech recognition tends to ignore punctuation, resulting in reduced readability of the output transcript, and the higher recognition error rate due to abundant medical terminology also its efficacy in final integration. Accordingly, the contributions of this paper consist of: (1) A punctuated Chinese medical corpus psChiMeS-14, which is a collection of 516 medical records in a total of a total of 867 minutes recorded by 15 professional nursing staff from Taipei Hospital of the Ministry of Health and Welfare. The speech corpus are manually labeled with punctuation marks containing columns, commas, and periods, ready for general end-to-end ASR models. (2) Two self-attention based speech recognition solutions: on by transformer and the other by conformer networks. Trained by and tested on psChiMeS-14 corpus, these two solutions deliver state-of-the-art recognition performance: CER (character error rate) of 13.1% and 10.5%, and KER (Keyword error rate) of 17.22% and 13.10%, respectively, which is contrasted to the CER of 15.70% and the KER of 22.50% by the Joint CTC/Attention architecture. The solutions trained by pChiMeS-14 can also be applied to the speech recognition of general Chinese hospital medical records. (3) An online medical speech recognition system, which is developed in a python environment using MVC framework. Through API interfacing, the front-end web user interface and the back-end speech recognition model are integrated seamlessly. This integrated system facilitates the most encountered paper processing in a hospital, including real-time speech recognition, recognition of recorded speech files, and uploading and retrieval of recognized records.
[1] S. Scheidenhelm, "Nurses' Perceptions of the Impact of Electronic Health Records
on Work and Patient Outcomes," Computers, informatics, nursing : CIN, vol. 26,
pp. 69-77, 03/01 2008.
[2] L. Block, R. Habicht, A. Wu, S. Desai, K. Wang, K. Silva, T. Niessen, N. Oliver,
and L. Feldman, “In the wake of the 2003 and 2011 duty hours regulations, how
do internal medicine interns spend their time?,”Journal of General Internal
Medicine, vol. 28, no. 8, pp. 1042–1047, 2013
[3] T. H. Payne, W. D. Alonso, J. A. Markiel, K. Lybarger, and A. A. White, "Using
voice to create hospital progress notes: Description of a mobile application and
supporting system integrated with a commercial electronic health record," Journal
of Biomedical Informatics, Article vol. 77, pp. 91-96, 2018.
[4] B. Garg and Anika, “Analysis of Punctuation Prediction Models for Automated
Transcript Generation in MOOC Videos,” in Proceedings of the 6th IEEE
International Conference on MOOCS Innovation and Technology In Education,
MITE 2018, 2018, pp. 19-26.
[5] R. Zhang and C. Zhang, “Dynamic Sentence Boundary Detection for
Simultaneous Translation,” in Proceedings of the First Workshop on Automatic
Simultaneous Translation, Seattle, Washington, 2020, pp. 1-9.
[6] D. Li, N. Arivazhagan, C. Cherry, and D. Padfield, “Sentence Boundary
Augmentation For Neural Machine Translation Robustness,” arXiv preprint
arXiv:2010.11132, 2020.
[7] D. Su, X. Wu, and L. Xu, “Gmm-hmm acoustic model training by a two level
procedure with gaussian components determined by automatic model selection,”
ICASSP, IEEE International Conference on Acoustics, Speech and Signal
Processing - Proceedings, pp. 4890–4893,
[8] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal
classification: Labelling unsegmented sequence data with recurrent neural
networks," 2006, vol. 148, pp. 369-376.
[9] L. Lu, X. Zhang, K. Cho, and S. Renals, "A study of the recurrent neural network
encoder-decoder for large vocabulary speech recognition," in Sixteenth Annual
Conference of the International Speech Communication Association, 2015.
58
[10] W. Chan, N. Jaitly, Q. V. Le, and O. J. a. p. a. Vinyals, "Listen, attend and spell,"
2015.
[11] S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech
recognition using multi-task learning," in 2017 IEEE international conference on
acoustics, speech and signal processing (ICASSP), 2017, pp. 4835-4839: IEEE.
[12] 陳. 柏琳, "探究端對端混合模型架構於華語語音辨識," 中文計算語言學期刊,
文章 vol. 24, no. 1, 2019.
[13] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information
Processing Systems, 2017, vol. 2017-December, pp. 5999-6009.
[14] L. Dong, S. Xu, and B. Xu, "Speech-Transformer: A No-Recurrence Sequence-to-
Sequence Model for Speech Recognition," in 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5884-5888.
[15] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence-to-sequence speech
recognition with the transformer in Mandarin Chinese,” in Proceedings of the
Annual Conference of the International Speech Communication Association,
INTERSPEECH, 2018, vol. 2018-September, pp. 791-795.
[16] S. Karita et al., "A Comparative Study on Transformer vs RNN in Speech
Applications," in 2019 IEEE Automatic Speech Recognition and Understanding
Workshop, ASRU 2019 - Proceedings, 2019, pp. 449-456.
[17] H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, “Transformer-Based Online
CTC/Attention End-To-End Speech Recognition Architecture,” in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing -
Proceedings, 2020, vol. 2020-May, pp. 6084-6088.
[18] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech
recognition," in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2020, vol. 2020-October, pp. 5036-
5040.
[19] W. Li, J. Qin, C. C. Chiu, R. Pang, and Y. He, "Parallel rescoring with transformer
for streaming on-device speech recognition," in Proceedings of the Annual
Conference of the International Speech Communication Association,
INTERSPEECH, 2020, vol. 2020-October, pp. 2122-2126.
[20] Z. Yao et al., "WeNet: Production oriented Streaming and Non-streaming End-to-
End Speech Recognition Toolkit," p. arXiv: 2102.01547, 2021.
59
[21] W. Salloum, G. Finley, E. Edwards, M. Miller, and D. Suendermann-Oeft, “Deep
Learning for Punctuation Restoration in Medical Reports,” in BioNLP 2017,
Vancouver, Canada, 2017, pp. 159-164.
[22] M. Sunkara, S. Ronanki, K. Dixit, S. Bodapati, and K. Kirchhoff, “Robust
prediction of punctuation and truecasing for medical asr,” in Proceedings of the
First Workshop on Natural Language Processing for Medical Conversations, 2020,
pp. 53-62.
[23] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep
bidirectional transformers for language understanding,” in NAACL HLT 2019 -
2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies - Proceedings of the
Conference, 2019, vol. 1, pp. 4171-4186.
[24] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ErniE: Enhanced
language representation with informative entities,” in ACL 2019 - 57th Annual
Meeting of the Association for Computational Linguistics, Proceedings of the
Conference, 2020, pp. 1441-1451.
[25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a Method for Automatic
Evaluation of Machine Translation," in Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics, Philadelphia, Pennsylvania,
USA, 2002, pp. 311-318: Association for Computational Linguistics.
[26] E. Cho, J. Niehues, and A. Waibel, “NMT-based segmentation and punctuation
insertion for real-Time spoken language translation,” in Proceedings of the Annual
Conference of the International Speech Communication Association,
INTERSPEECH, 2017, vol. 2017-August, pp. 2645-2649.
[27] “Flask Framework” Available from: https://flask.palletsprojects.com/en/2.0.x/
[28] M. R. Mufid, A. Basofi, M. U. H. Al Rasyid, I. F. Rochimansyah, and A. Rokhim,
"Design an MVC Model using Python for Flask Framework Development," in IES
2019 - International Electronics Symposium: The Role of Techno-Intelligence in
Creating an Open Energy System Towards Energy Democracy, Proceedings, 2019,
pp. 214-219.
[29] “Deploying PyTorch In Python Via a REST API with Flassk” Available from:
https://pytorch.org/tutorials/intermediate/flask_rest_api_tutorial.html
[30] “ELAN annotation tool” Available from: https://archive.mpi.nl/tla/elan
60
[31] “教育部《重訂標點符號手冊》修訂版” Available from:
https://language.moe.gov.tw/001/upload/files/site_content/m0001/hau/haushou.ht
m#suo
[32] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for Activation Functions,"
arXiv e-prints, p. arXiv:1710.05941, 2017.
[33] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, "Lite Transformer with Long-Short
Range Attention," arXiv e-prints, p. arXiv:2004.11886, 2020.
[34] S. Watanabe et al., "ESPNet: End-to-end speech processing toolkit," in
Proceedings of the Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2018, vol. 2018-September, pp. 2207-2211.
[35] “espnet/espnet, github code. ” Available from: https://github.com/espnet/espnet
[36] “Nginx web server .” Available from: http://nginx.org/
[37] “Mongo DB” Available from: https://www.mongodb.com/
[38] W. Salloum, G. Finley, E. Edwards, M. Miller, and D. Suendermann-Oeft,
"Automated preamble detection in dictated medical reports," in BioNLP 2017,
2017, pp. 287-295.