簡易檢索 / 詳目顯示

研究生: 李憶萱
Yi-Shiuan Li
論文名稱: 中文醫學語音辨識: 語音資料庫與自動辨識技術
Chinese Medical Speech Recognition: Speech Corpus and Automatic Speech Recognition Technique
指導教授: 鍾聖倫
Sheng-Luen Chung
口試委員: 鍾聖倫
Sheng-Luen Chung
方文賢
Wen-Hsien Fang
陳柏琳
Berlin Chen
廖元甫
Yuan-Fu Liao
丁賢偉
Hsien-Wei Ting
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 96
中文關鍵詞: 深度學習中文醫療語音庫語音辨識
外文關鍵詞: Deep learning, Chinese medical speech corpus, Speech recognition
相關次數: 點閱:319下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

針對中文醫療語音辨識技術,本論文從資料為中心的觀點 (data centric),按
照機器學習的開發與佈署 (MLOps) 的流程進行研究。首先是資料集的淨化:校
正先前誤標示的文本 ChiMeS-14,並按語義完整性重新切割 sChiMeS-14。其次
是語音辨識模型的優化:在固定 Joint CTC/Attention 的 ASR 網路架構後,針對
語料庫極端受限的挑戰,窮究包括:文獻上利用波形與光譜資料增量以及併用語
言模型等技術的辨識提升效果。最後是概念飄移 (concept drift) 的舒解:當佈署
後所要辨識的醫科類別不同於原訓練集採樣的醫學科別時,會因專業術語詞彙不
同而引起的Out-of-Keyword (OOK) 問題。本研究提出使用端對端的關鍵字增量
方法,在不需重新錄製新科別完整病歷語音的前提下,即可減緩 OOK 的問題。
整體來看,為了促進中文醫療語音辨識的發展,本論文的具體貢獻有三,分別是:
(一) sChiMeS 語料庫,其為時14.4 小時,共7,225 句語音。(二) 訓練好的Joint
CTC/Attention ASR 模型,其在 sChiMeS-14 的測試集上的字符錯誤率 (Character
Error Rate,CER) 和關鍵字錯誤率 (Keyword Error Rate,KER) 分別為12.85%
和17.62%。以及(三) 用來評估其他ASR 模型績效的測試平台。細節請參 ChiMeS
入口網站 (https://iclab.ee.ntust.edu.tw/home)。


Concerning Chinese medical speech recognition technology, this study readdresses
earlier encountered issues in accordance with the process of Machine
Learning Engineering for Production (MLOps) from a data centric perspective. First
is data cleansing of a speech corpus ChiMeS (Chinese Medical Speech): With
mislabels now corrected and speech utterances newly segmented to fit complete
sentences is the new released sChiMeS-14, which contains readouts of 516 in-patient
records by 15 professional nurses. Second is optimization of the speech recognition
model: Given the Joint CTC/Attention model as ASR baseline, auxiliary measures of
data augmentation and language models are examined to verify their efficacy in
boosting recognition performance out of very limited speech corpus. Third and the
last addresses concept drift problem: When the trained model is deployed in different
setting than that where training data was collected, Out-of-Keyword (OOK) problem
may emerge. An end-to-end keyword augmentation method is proposed to alleviate
the problem without resorting to significant amount of new recording. Overall, in
order to facilitate the development of Chinese medical speech recognition, this paper
contributes: (1) The sChiMeS corpus, the first Chinese Medicine Speech corpus of its
kind, which lasts 14.4 hours, with a total of 7,225 sentences. (2) A trained Joint
CTC/Attention ASR model, yielding a Character Error Rate (CER) of 12.85% and a
Keyword Error Rate (KER) of 17.62, respectively, when tested on the sChiMeS-14 test
set. And (3) an evaluation platform to solicit competition and to compare performance
of other ASR models. All the released resources can be found in the ChiMeS portal
(https://iclab.ee.ntust.edu.tw/home).

摘要 I Abstract II 致謝 III 目錄 IV 圖目錄 VIII 表目錄 IX 第1章、 緒論 1 1.1 研究背景與動機 1 1.2 醫療語音辨識之困難與挑戰 1 1.3 論文貢獻 2 1.4 論文架構 3 第2章、 文獻審閱 4 2.1 語料庫 4 2.2 ASR的演進 6 2.3 提升辨識效果的方法 11 2.3.1 轉移學習 11 2.3.2 資料增量 12 2.3.3 語言模型 12 2.4 解決OOV與稀有字的方法 14 第3章、 ChiMeS語料庫與入口網站 16 3.1 收集與標註 16 3.2 訓練/測試分佈與評估 18 3.3 釋出之資源 22 第4章、 實驗方法 26 4.1 訓練與測試流程概述 26 4.2 Joint CTC/Attention 架構 27 4.2.1 共同編碼器(Shared Encoder) 28 4.2.2 CTC解碼器(CTC Decoder) 28 4.2.3 注意力解碼器(Attention Decoder) 30 4.2.4 多任務學習(Multitask Learning,MTL) 32 4.3 結合語言模型之解碼器 33 4.3.1 RNN-LM與N-gram LM 33 4.3.2 共同解碼(Joint Decoding) 35 4.4 資料增量 37 4.4.1 頻譜增量 39 4.4.2 波形增量 39 4.4.3 關鍵字增量 41 4.5 轉移學習 42 第5章、 實驗結果 44 5.1 實驗1: 淨化前後之語料庫測試結果 45 5.1.1 兩種語料庫:ChiMeS-14, sChiMeS-14 45 5.1.2 不同語料庫個別訓練之測試結果 45 5.1.3 不同KER定義所測試之結果 46 5.2 實驗2: 提升ASR結果之方法比較 47 5.2.1 預訓練與否 47 5.2.2 消融測試 47 5.2.3 比較相似架構之結果 52 5.2.4 比較中英混雜語料庫之結果 53 5.3 實驗3: 跨領域測試 53 5.3.1 關鍵字增量之效果 53 5.3.2 解決OOV問題之方法比較 55 第6章、 文獻比較與結論 57 6.1 實驗比較 57 6.1.1 比較相似架構 57 6.1.2 比較中英文語料庫之結果 58 6.1.3 解決OOV或OOK的提升幅度 58 6.1.4 比較醫療語料庫之結果 59 6.2 結論 59 參考文獻 61 附錄A、中英詞彙對照 66 附錄B、sChiMeS-14字典 68 中文字符 68 英文單詞 71 附錄C、關鍵字列表 76 口試委員之建議與答覆 82

[1] Y. R. Oh, M. Kim, and H. K. Kim, "Acoustic and pronunciation model adaptation
for context-independent and context-dependent pronunciation variability of nonnative
speech," in ICASSP, IEEE International Conference on Acoustics, Speech
and Signal Processing - Proceedings, 2008, pp. 4281-4284.
[2] A. Hannun et al., "Deep speech: Scaling up end-to-end speech recognition," arXiv
preprint arXiv:1412.5567, 2014.
[3] O. Spjuth, J. Frid, and A. Hellander, "The machine learning life cycle and the cloud:
implications for drug discovery," Expert Opinion on Drug Discovery, Review
2021.
[4] C. C. Chiu et al., "Speech recognition for medical conversations," in Proceedings
of the Annual Conference of the International Speech Communication Association,
INTERSPEECH, 2018, vol. 2018-September, pp. 2972-2976.
[5] J. Du, X. Na, X. Liu, and H. Bu, "AISHELL-2: transforming mandarin ASR
research into industrial scale," arXiv preprint arXiv:1808.10583, 2018.
[6] T. Guo et al., "DiDiSpeech: A Large Scale Mandarin Speech Corpus," arXiv
preprint arXiv:2010.09275, 2020.
[7] G. Lee, T. N. Ho, E. S. Chng, and H. Li, "A review of the mandarin-english codeswitching
corpus: SEAME," in Proceedings of the 2017 International Conference
on Asian Language Processing, IALP 2017, 2018, vol. 2018-January, pp. 210-213.
[8] Y. F. Liao, Y. H. S. Chang, Y. C. Lin, W. H. Hsu, M. Pleva, and J. Juhar, "Formosa
Speech in the Wild Corpus for Improving Taiwanese Mandarin Speech-Enabled
Human-Computer Interaction," Journal of Signal Processing Systems, Article vol.
92, no. 8, pp. 853-873, 2020.
[9] R. Ardila et al., "Common voice: A massively-multilingual speech corpus," in
LREC 2020 - 12th International Conference on Language Resources and
Evaluation, Conference Proceedings, 2020, pp. 4218-4222.
[10] D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE 2011 workshop
on automatic speech recognition and understanding, 2011, no. CONF: IEEE
Signal Processing Society.
[11] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal
classification: Labelling unsegmented sequence data with recurrent neural networks," in ACM International Conference Proceeding Series, 2006, vol. 148,
pp. 369-376.
[12] Y. Miao, M. Gowayyed, X. Na, T. Ko, F. Metze, and A. Waibel, "An empirical
exploration of CTC acoustic models," in ICASSP, IEEE International Conference
on Acoustics, Speech and Signal Processing - Proceedings, 2016, vol. 2016-May,
pp. 2623-2627.
[13] D. Amodei et al., "Deep speech 2: End-to-end speech recognition in English and
Mandarin," in 33rd International Conference on Machine Learning, ICML 2016,
2016, vol. 1, pp. 312-321.
[14] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural
network for large vocabulary conversational speech recognition," in ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing -
Proceedings, 2016, vol. 2016-May, pp. 4960-4964.
[15] S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech
recognition using multi-task learning," in ICASSP, IEEE International Conference
on Acoustics, Speech and Signal Processing - Proceedings, 2017, pp. 4835-4839.
[16] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, "Advances in joint CTC-attention
based end-to-end speech recognition with a deep CNN encoder and RNN-LM," in
Proceedings of the Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2017, vol. 2017-August, pp. 949-953.
[17] T. Zhu and C. Cheng, "Joint CTC-Attention End-to-End Speech Recognition with
a Triangle Recurrent Neural Network Encoder," Journal of Shanghai Jiaotong
University (Science), Article vol. 25, no. 1, pp. 70-75, 2020.
[18] R. Li, X. Wang, S. H. Mallidi, S. Watanabe, T. Hori, and H. Hermansky, "Multistream
end-To-end speech recognition," IEEE/ACM Transactions on Audio
Speech and Language Processing, Article vol. 28, pp. 646-655, 2020, Art. no.
8932598.
[19] N. Moritz, T. Hori, and J. L. Roux, "Streaming end-To-end speech recognition with
joint ctc-Attention based models," in 2019 IEEE Automatic Speech Recognition
and Understanding Workshop, ASRU 2019 - Proceedings, 2019, pp. 936-943.
[20] A. Vaswani et al., "Attention is all you need," in Advances in Neural Information
Processing Systems, 2017, vol. 2017-December, pp. 5999-6009.
[21] L. Dong, S. Xu, and B. Xu, "Speech-transformer: A no-recurrence sequence-to sequence model for speech recognition," in ICASSP, IEEE International
Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018, vol.
2018-April, pp. 5884-5888.
[22] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech
recognition," in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2020, vol. 2020-October, pp. 5036-
5040.
[23] S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and T. Nakatani,
"Improving transformer-based end-to-end speech recognition with connectionist
temporal classification and language model integration," in Proceedings of the
Annual Conference of the International Speech Communication Association,
INTERSPEECH, 2019, vol. 2019-September, pp. 1408-1412.
[24] G. Heigold et al., "Multilingual acoustic models using distributed deep neural
networks," in 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing, 2013, pp. 8619-8623: IEEE.
[25] J. Shor et al., "Personalizing ASR for dysarthric and accented speech with limited
data," in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2019, vol. 2019-September, pp.
784-788.
[26] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, "Audio augmentation for speech
recognition," in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2015, vol. 2015-January, pp. 3586-
3589.
[27] D. S. Park et al., "Specaugment: A simple data augmentation method for automatic
speech recognition," in Proceedings of the Annual Conference of the International
Speech Communication Association, INTERSPEECH, 2019, vol. 2019-September,
pp. 2613-2617.
[28] C. Liu, Q. Zhang, X. Zhang, K. Singh, Y. Saraf, and G. Zweig, "Multilingual
graphemic hybrid ASR with massive data augmentation," arXiv preprint
arXiv:1909.06522, 2019.
[29] K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, "Scalable modified Kneser-
Ney language model estimation," in ACL 2013 - 51st Annual Meeting of the
Association for Computational Linguistics, Proceedings of the Conference, 2013,
vol. 2, pp. 690-696.
[30] T. Mikolov, M. Karafiát, L. Burget, C. Jan, and S. Khudanpur, "Recurrent neural
network based language model," in Proceedings of the 11th Annual Conference of
the International Speech Communication Association, INTERSPEECH 2010,
2010, pp. 1045-1048.
[31] T. Mao, Y. Khassanov, V. T. Pham, H. Xu, H. Huang, and E. S. Chng, "Approaches
to Improving Recognition of Underrepresented Named Entities in Hybrid ASR
Systems," in 2021 12th International Symposium on Chinese Spoken Language
Processing, ISCSLP 2021, 2021.
[32] Y. Khassanov, Z. Zeng, V. T. Pham, H. Xu, and E. S. J. a. p. a. Chng, "Enriching
Rare Word Representations in Neural Language Models by Embedding Matrix
Augmentation," 2019.
[33] ELAN, https://archive.mpi.nl/tla/elan
[34] How many syllables, https://www.howmanysyllables.com
[35] R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent
neural networks," in 30th International Conference on Machine Learning, ICML
2013, 2013, pp. 2347-2355.
[36] X. Chang, Y. Qian, K. Yu, and S. Watanabe, "End-to-end Monaural Multi-speaker
ASR System without Pretraining," in ICASSP, IEEE International Conference on
Acoustics, Speech and Signal Processing - Proceedings, 2019, vol. 2019-May, pp.
6256-6260.
[37] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, "Attention-based
models for speech recognition," in Advances in Neural Information Processing
Systems, 2015, vol. 2015-January, pp. 577-585.
[38] K. Heafield, “Kenlm: Faster and smaller language model queries,” Proceedings of
the Sixth Workshop on Statistical Machine Translation, pp. 187–197, 2011.
[39] T. Hori, S. Watanabe, and J. Hershey, “Joint ctc/attention decoding for end-to-end
speech recognition,”ACL 2017 - 55th Annual Meeting of the Association for
Computational Linguistics, Proceedings of the Conference (Long Papers), vol. 1,
pp. 518–529, 2017.
[40] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, et al.,
“Specaugment: A simple data augmentation method for automatic speech
recognition,” arXiv preprint arXiv:1904.08779, 2019.
[41] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2015, pp. 3586-3589.
[42] SoX, audio manipulation tool. [Online]. Available: http://sox.sourceforge.net/
[43] “Alexander-h-liu/ end-to-end-asr-pytorch, github code.”Available from:
https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch.
[44] Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, "On the end-toend
solution to Mandarin-English code-switching speech recognition," in
Proceedings of the Annual Conference of the International Speech Communication
Association, INTERSPEECH, 2019, vol. 2019-September, pp. 2165-2169.
[45] 陳. 柏琳, "探究端對端混合模型架構於華語語音辨識," 中文計算語言學期
刊, 文章 vol. 24, no. 1, 2019.
[46] J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S. Stober,
"Transfer learning for speech recognition on a budget," arXiv preprint
arXiv:1706.00290, 2017.

無法下載圖示 全文公開日期 2024/08/12 (校內網路)
全文公開日期 2026/08/12 (校外網路)
全文公開日期 2026/08/12 (國家圖書館:臺灣博碩士論文系統)
QR CODE