研究生: |
李憶萱 Yi-Shiuan Li |
---|---|
論文名稱: |
中文醫學語音辨識: 語音資料庫與自動辨識技術 Chinese Medical Speech Recognition: Speech Corpus and Automatic Speech Recognition Technique |
指導教授: |
鍾聖倫
Sheng-Luen Chung |
口試委員: |
鍾聖倫
Sheng-Luen Chung 方文賢 Wen-Hsien Fang 陳柏琳 Berlin Chen 廖元甫 Yuan-Fu Liao 丁賢偉 Hsien-Wei Ting |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 96 |
中文關鍵詞: | 深度學習 、中文醫療語音庫 、語音辨識 |
外文關鍵詞: | Deep learning, Chinese medical speech corpus, Speech recognition |
相關次數: | 點閱:319 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
針對中文醫療語音辨識技術,本論文從資料為中心的觀點 (data centric),按
照機器學習的開發與佈署 (MLOps) 的流程進行研究。首先是資料集的淨化:校
正先前誤標示的文本 ChiMeS-14,並按語義完整性重新切割 sChiMeS-14。其次
是語音辨識模型的優化:在固定 Joint CTC/Attention 的 ASR 網路架構後,針對
語料庫極端受限的挑戰,窮究包括:文獻上利用波形與光譜資料增量以及併用語
言模型等技術的辨識提升效果。最後是概念飄移 (concept drift) 的舒解:當佈署
後所要辨識的醫科類別不同於原訓練集採樣的醫學科別時,會因專業術語詞彙不
同而引起的Out-of-Keyword (OOK) 問題。本研究提出使用端對端的關鍵字增量
方法,在不需重新錄製新科別完整病歷語音的前提下,即可減緩 OOK 的問題。
整體來看,為了促進中文醫療語音辨識的發展,本論文的具體貢獻有三,分別是:
(一) sChiMeS 語料庫,其為時14.4 小時,共7,225 句語音。(二) 訓練好的Joint
CTC/Attention ASR 模型,其在 sChiMeS-14 的測試集上的字符錯誤率 (Character
Error Rate,CER) 和關鍵字錯誤率 (Keyword Error Rate,KER) 分別為12.85%
和17.62%。以及(三) 用來評估其他ASR 模型績效的測試平台。細節請參 ChiMeS
入口網站 (https://iclab.ee.ntust.edu.tw/home)。
Concerning Chinese medical speech recognition technology, this study readdresses
earlier encountered issues in accordance with the process of Machine
Learning Engineering for Production (MLOps) from a data centric perspective. First
is data cleansing of a speech corpus ChiMeS (Chinese Medical Speech): With
mislabels now corrected and speech utterances newly segmented to fit complete
sentences is the new released sChiMeS-14, which contains readouts of 516 in-patient
records by 15 professional nurses. Second is optimization of the speech recognition
model: Given the Joint CTC/Attention model as ASR baseline, auxiliary measures of
data augmentation and language models are examined to verify their efficacy in
boosting recognition performance out of very limited speech corpus. Third and the
last addresses concept drift problem: When the trained model is deployed in different
setting than that where training data was collected, Out-of-Keyword (OOK) problem
may emerge. An end-to-end keyword augmentation method is proposed to alleviate
the problem without resorting to significant amount of new recording. Overall, in
order to facilitate the development of Chinese medical speech recognition, this paper
contributes: (1) The sChiMeS corpus, the first Chinese Medicine Speech corpus of its
kind, which lasts 14.4 hours, with a total of 7,225 sentences. (2) A trained Joint
CTC/Attention ASR model, yielding a Character Error Rate (CER) of 12.85% and a
Keyword Error Rate (KER) of 17.62, respectively, when tested on the sChiMeS-14 test
set. And (3) an evaluation platform to solicit competition and to compare performance
of other ASR models. All the released resources can be found in the ChiMeS portal
(https://iclab.ee.ntust.edu.tw/home).
[1] Y. R. Oh, M. Kim, and H. K. Kim, "Acoustic and pronunciation model adaptation
for context-independent and context-dependent pronunciation variability of nonnative
speech," in ICASSP, IEEE International Conference on Acoustics, Speech
and Signal Processing - Proceedings, 2008, pp. 4281-4284.
[2] A. Hannun et al., "Deep speech: Scaling up end-to-end speech recognition," arXiv
preprint arXiv:1412.5567, 2014.
[3] O. Spjuth, J. Frid, and A. Hellander, "The machine learning life cycle and the cloud:
implications for drug discovery," Expert Opinion on Drug Discovery, Review
2021.
[4] C. C. Chiu et al., "Speech recognition for medical conversations," in Proceedings
of the Annual Conference of the International Speech Communication Association,
INTERSPEECH, 2018, vol. 2018-September, pp. 2972-2976.
[5] J. Du, X. Na, X. Liu, and H. Bu, "AISHELL-2: transforming mandarin ASR
research into industrial scale," arXiv preprint arXiv:1808.10583, 2018.
[6] T. Guo et al., "DiDiSpeech: A Large Scale Mandarin Speech Corpus," arXiv
preprint arXiv:2010.09275, 2020.
[7] G. Lee, T. N. Ho, E. S. Chng, and H. Li, "A review of the mandarin-english codeswitching
corpus: SEAME," in Proceedings of the 2017 International Conference
on Asian Language Processing, IALP 2017, 2018, vol. 2018-January, pp. 210-213.
[8] Y. F. Liao, Y. H. S. Chang, Y. C. Lin, W. H. Hsu, M. Pleva, and J. Juhar, "Formosa
Speech in the Wild Corpus for Improving Taiwanese Mandarin Speech-Enabled
Human-Computer Interaction," Journal of Signal Processing Systems, Article vol.
92, no. 8, pp. 853-873, 2020.
[9] R. Ardila et al., "Common voice: A massively-multilingual speech corpus," in
LREC 2020 - 12th International Conference on Language Resources and
Evaluation, Conference Proceedings, 2020, pp. 4218-4222.
[10] D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE 2011 workshop
on automatic speech recognition and understanding, 2011, no. CONF: IEEE
Signal Processing Society.
[11] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal
classification: Labelling unsegmented sequence data with recurrent neural networks," in ACM International Conference Proceeding Series, 2006, vol. 148,
pp. 369-376.
[12] Y. Miao, M. Gowayyed, X. Na, T. Ko, F. Metze, and A. Waibel, "An empirical
exploration of CTC acoustic models," in ICASSP, IEEE International Conference
on Acoustics, Speech and Signal Processing - Proceedings, 2016, vol. 2016-May,
pp. 2623-2627.
[13] D. Amodei et al., "Deep speech 2: End-to-end speech recognition in English and
Mandarin," in 33rd International Conference on Machine Learning, ICML 2016,
2016, vol. 1, pp. 312-321.
[14] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition," in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing -
Proceedings, 2016, vol. 2016-May, pp. 4960-4964.
[15] S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech
recognition using multi-task learning," in ICASSP, IEEE International Conference
on Acoustics, Speech and Signal Processing - Proceedings, 2017, pp. 4835-4839.
[16] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, "Advances in joint CTC-attention
based end-to-end speech recognition with a deep CNN encoder and RNN-LM," in
Proceedings of the Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2017, vol. 2017-August, pp. 949-953.
[17] T. Zhu and C. Cheng, "Joint CTC-Attention End-to-End Speech Recognition with
a Triangle Recurrent Neural Network Encoder," Journal of Shanghai Jiaotong
University (Science), Article vol. 25, no. 1, pp. 70-75, 2020.
[18] R. Li, X. Wang, S. H. Mallidi, S. Watanabe, T. Hori, and H. Hermansky, "Multistream
end-To-end speech recognition," IEEE/ACM Transactions on Audio
Speech and Language Processing, Article vol. 28, pp. 646-655, 2020, Art. no.
8932598.
[19] N. Moritz, T. Hori, and J. L. Roux, "Streaming end-To-end speech recognition with
joint ctc-Attention based models," in 2019 IEEE Automatic Speech Recognition
and Understanding Workshop, ASRU 2019 - Proceedings, 2019, pp. 936-943.
[20] A. Vaswani et al., "Attention is all you need," in Advances in Neural Information
Processing Systems, 2017, vol. 2017-December, pp. 5999-6009.
[21] L. Dong, S. Xu, and B. Xu, "Speech-transformer: A no-recurrence sequence-to sequence model for speech recognition," in ICASSP, IEEE International
Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018, vol.
2018-April, pp. 5884-5888.
[22] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech
recognition," in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2020, vol. 2020-October, pp. 5036-
5040.
[23] S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and T. Nakatani,
"Improving transformer-based end-to-end speech recognition with connectionist
temporal classification and language model integration," in Proceedings of the
Annual Conference of the International Speech Communication Association,
INTERSPEECH, 2019, vol. 2019-September, pp. 1408-1412.
[24] G. Heigold et al., "Multilingual acoustic models using distributed deep neural
networks," in 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing, 2013, pp. 8619-8623: IEEE.
[25] J. Shor et al., "Personalizing ASR for dysarthric and accented speech with limited
data," in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2019, vol. 2019-September, pp.
784-788.
[26] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, "Audio augmentation for speech
recognition," in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2015, vol. 2015-January, pp. 3586-
3589.
[27] D. S. Park et al., "Specaugment: A simple data augmentation method for automatic
speech recognition," in Proceedings of the Annual Conference of the International
Speech Communication Association, INTERSPEECH, 2019, vol. 2019-September,
pp. 2613-2617.
[28] C. Liu, Q. Zhang, X. Zhang, K. Singh, Y. Saraf, and G. Zweig, "Multilingual
graphemic hybrid ASR with massive data augmentation," arXiv preprint
arXiv:1909.06522, 2019.
[29] K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, "Scalable modified Kneser-
Ney language model estimation," in ACL 2013 - 51st Annual Meeting of the
Association for Computational Linguistics, Proceedings of the Conference, 2013,
vol. 2, pp. 690-696.
[30] T. Mikolov, M. Karafiát, L. Burget, C. Jan, and S. Khudanpur, "Recurrent neural
network based language model," in Proceedings of the 11th Annual Conference of
the International Speech Communication Association, INTERSPEECH 2010,
2010, pp. 1045-1048.
[31] T. Mao, Y. Khassanov, V. T. Pham, H. Xu, H. Huang, and E. S. Chng, "Approaches
to Improving Recognition of Underrepresented Named Entities in Hybrid ASR
Systems," in 2021 12th International Symposium on Chinese Spoken Language
Processing, ISCSLP 2021, 2021.
[32] Y. Khassanov, Z. Zeng, V. T. Pham, H. Xu, and E. S. J. a. p. a. Chng, "Enriching
Rare Word Representations in Neural Language Models by Embedding Matrix
Augmentation," 2019.
[33] ELAN, https://archive.mpi.nl/tla/elan
[34] How many syllables, https://www.howmanysyllables.com
[35] R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent
neural networks," in 30th International Conference on Machine Learning, ICML
2013, 2013, pp. 2347-2355.
[36] X. Chang, Y. Qian, K. Yu, and S. Watanabe, "End-to-end Monaural Multi-speaker
ASR System without Pretraining," in ICASSP, IEEE International Conference on
Acoustics, Speech and Signal Processing - Proceedings, 2019, vol. 2019-May, pp.
6256-6260.
[37] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, "Attention-based
models for speech recognition," in Advances in Neural Information Processing
Systems, 2015, vol. 2015-January, pp. 577-585.
[38] K. Heafield, “Kenlm: Faster and smaller language model queries,” Proceedings of
the Sixth Workshop on Statistical Machine Translation, pp. 187–197, 2011.
[39] T. Hori, S. Watanabe, and J. Hershey, “Joint ctc/attention decoding for end-to-end
speech recognition,”ACL 2017 - 55th Annual Meeting of the Association for
Computational Linguistics, Proceedings of the Conference (Long Papers), vol. 1,
pp. 518–529, 2017.
[40] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, et al.,
“Specaugment: A simple data augmentation method for automatic speech
recognition,” arXiv preprint arXiv:1904.08779, 2019.
[41] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2015, pp. 3586-3589.
[42] SoX, audio manipulation tool. [Online]. Available: http://sox.sourceforge.net/
[43] “Alexander-h-liu/ end-to-end-asr-pytorch, github code.”Available from:
https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch.
[44] Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, "On the end-toend
solution to Mandarin-English code-switching speech recognition," in
Proceedings of the Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2019, vol. 2019-September, pp. 2165-2169.
[45] 陳. 柏琳, "探究端對端混合模型架構於華語語音辨識," 中文計算語言學期
刊, 文章 vol. 24, no. 1, 2019.
[46] J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S. Stober,
"Transfer learning for speech recognition on a budget," arXiv preprint
arXiv:1706.00290, 2017.