簡易檢索 / 詳目顯示

研究生: 陳逸勳
Yi-Xun Chen
論文名稱: 護理交班情境之語音辨識技術
Speech Recognition for Nursing Shift Handover Context
指導教授: 鍾聖倫
Sheng-Luen Chung
口試委員: 鍾聖倫
Sheng-Luen Chung
蘇順豐
Shun-Feng Su
郭重顯
Chung-Hsien Kuo
方文賢
Wen-Hsien Fang
徐繼聖
Gee-Sern Hsu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 55
中文關鍵詞: 特定情境語音辨識
外文關鍵詞: Special context, Speech Recognition
相關次數: 點閱:246下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音辨識系統是將語音自動轉譯為文字的技術。像是 Google、Siri與百度等語音辨識系統主要是針對泛用情境下的語音內容作辨識,然而,在面向特定專業情境,如護理交班之對談語音時,因語音中的特殊的詞彙與句型等,會讓現有技術的辨識效果甚或超過40 % 的錯誤率。本研究定向護理交班語音之特定情境 (context),發展對護理專業情境下對談語句語音之逐字辨識技術,重點在於解決此特殊專業中:筆記式精要陳述句型、術語詞彙、中英文混雜。對此,本研究首先自製中文之護理交班特定情境的語料庫:經由在台北醫院護理站實錄護理交班對談的語音資料,進行人工標註對應文字 (Ground Truth)。對於其中包含醫護術語之英文字彙,本研究使用subword對英文單字進行標註,完成為時328分鐘,共有2979句的中文護理交班情境語料庫。利用此語料庫,我們得到反應此專業情境中語句中前後文關連特性的語言模型 (Language Model),然後在 Deep Speech 2 網路的骨幹架構上,透過調整:(1) 深度學習架構中層標準化 (layer normalization)、(2) 加入語言模型之解碼演算法、以及引入傳統語音調變與加強專業術語訓練集之資料增量等技術,合併提升特殊情境中語音的辨識率。其中,考量建置語音資料庫需要花費大量的人力與時間,本研究自行建置之護理交班語音資料庫數量有限,因此使用三種資料增量的技術,藉由增加訓練資料的數量緩解資料不足造成之模型辨識效果不佳的問題。最後,經過對此調整架構進行訓練與測試,我們得到11.966 % 的錯誤率,相較於針對一般情境所訓練的Google系統之45.90 3% 錯誤率,得到顯著的效率提升。


    Speech recognition techniques refer to the process of automatic transcribing speeches into their corresponding texts. Despite the availability of solutions like Google, Siri, and Baidu, speech recognition for speeches occurred in special professional settings is difficult. It is primarily because the particular vocabulary and expression patterns used are different from those in general contexts. This paper reports the recognition technique we have developed, geared to the speeches recorded during nursing shift handovers, where speeches are characterized by profuse usage of domain terminology, and mixed Chinese and not so perfect English, often in very terse sentence patterns. To model the special context, a Nursing Handover dataset has been collected that contains labeled audio speeches recorded at a nursing station at Taipei Hospital, totaling 2,979 sentences in 328 minutes long. English words and phrased are labeled by subwords of similar Chinese sound words to circumvent the problem of mixed language recognition. With the labeled corpus, a deep speech based speech recognition network is then trained end-to-end to capture the underlying sentence and phrase patterns. With the labeled transcripts, a language model that summarizes the contextual property of the professional speeches can be derived to model the particular terminology and phrases used. Based on a deep learning network of Deep Speech 2, we introduced the following modifications to improve recognition rate: layer normalization, beam search decoding with language model, and various data augmentation techniques. In particular, three data augmentation methods have been introduced to mitigate the problem of limited corpus data. Overall the modified deep learning network attains a significant improvement of the recognition results. The proposed solution attains a CER (Character Error Rate) of 11.966 %, which is compared with the 45.903 % CER by Google Speech API.

    ཱ要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Speech in Medical Contexts . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Paper Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Deep Learning Based Speech Recognition Solutions . . . . . . . . . . . 4 2.2 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3: Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Overview in a flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Pre-process, Deep Speech 2 network, and CTC . . . . . . . . . . . . . . 10 3.2.1 Pre-process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.2 Deep Speech 2 network . . . . . . . . . . . . . . . . . . . . . . 11 3.2.3 Connectionist Temporal Classification (CTC) . . . . . . . . . . . 13 3.3 Layer normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5.1 Greedy Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5.2 Beam Search Decoder with Language Model . . . . . . . . . . . 18 3.6 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6.1 Wave augmentation . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.6.2 SpecAugment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6.3 Cascade augmentation . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 4: Experiment Setups and Results . . . . . . . . . . . . . . . . . 25 4.1 Experiment 1: On fixed 500 Mandarin sentences . . . . . . . . . . . . . 26 4.1.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.2 Experiment objectives . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.3 Deep Speech 2 network and setup . . . . . . . . . . . . . . . . . 27 4.1.4 Three experiment settings . . . . . . . . . . . . . . . . . . . . . 28 4.2 Experiments 2: On direct read-outs from medical records . . . . . . . . . 30 4.2.1 Speech Corpuses . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 Ablation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Experiment 3: Cross domain tests on Different Datasets . . . . . . . . . . 33 Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Appendix A: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Appendix B: Dictionary of DS2 . . . . . . . . . . . . . . . . . . . . . . . . 44 Appendix C: Dictionary used for keyword cascade augmentation . . . . . . . 51

    [1] P. Swietojanski, A. Ghoshal, and S. Renals, “Revisiting hybrid and gmm-hmm system combination techniques,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6744–6748, May 2013.
    [2] T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4153–4156, March 2012.
    [3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, pp. 82–97, Nov 2012.
    [4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, March 2016.
    [5] Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4845–4849, March 2017.
    [6] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, and A. Coates, “Deep speech: Scaling up end-to-end speech recognition,” Computer Science, 2014.
    [7] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. Legresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, “Deep speech 2: End-to-end speech recognition in english and mandarin,” 33rd International Conference on Machine Learning, ICML 2016, vol. 1, pp. 312–321, 2016.
    [8] V. N. Nguyen and H. Holone, “N-best list re-ranking using syntactic score: A solution for improving speech recognition accuracy in air traffic control,” in 2016 16th International Conference on Control, Automation and Systems (ICCAS), pp. 1309–1314, Oct 2016.
    [9] V. N. Nguyen and H. Holone, “N-best list re-ranking using semantic relatedness and syntactic score: An approach for improving speech recognition accuracy in air traffic control,” in 2016 16th International Conference on Control, Automation and Systems (ICCAS), pp. 1315–1319, Oct 2016.
    [10] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584, April 2015.
    [11] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949, March 2016.
    [12] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778, April 2018.
    [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 2017-December, pp. 5999–6009, 2017.
    [14] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multitask learning,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839, March 2017.
    [15] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctc attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2017-August, pp. 949–953, 2017. cited By 38.
    [16] I. Sutskever, O. Vinyals, and Q. Le, “Sequence to sequence learning with neural networks,” Advances in Neural Information Processing Systems, vol. 4, no. January, pp. 3104–3112, 2014.
    [17] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training of end-to-end attention models for speech recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2018-September, pp. 7–11, 2018.
    [18] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” ACM International Conference Proceeding Series, vol. 148, pp. 369–376, 2006.
    [19] K. Han, A. Chandrashekaran, J. Kim, and I. Lane, “The capio 2017 conversational speech recognition system,” arXiv preprint arXiv:1801.00059, 2017.
    [20] K. Han, S. Hahm, B.-H. Kim, J. Kim, and I. Lane, “Deep learning-based telephony speech recognition in the wild,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2017-August, pp. 1323–1327, 2017.
    [21] S. Watanabe, T. Hori, and J. R. Hershey, “Language independent end-to-end architecture for joint language identification and speech recognition,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 265–271, Dec 2017.
    [22] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao, “Multilingual speech recognition with a single end-to-end model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4904–4908, April 2018.
    [23] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5621–5625, May 2019.
    [24] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to ad hoc information retrieval,” SIGIR Forum (ACM Special Interest Group on Information Retrieval), pp. 334–342, 2001.
    [25] Z. Yang, Z. Dai, R. Salakhutdinov, and W. Cohen, “Breaking the softmax bottleneck: A high-rank rnn language model,” ICLR, 2018.
    [26] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2015-January, pp. 3586–3589, 2015. cited By 131.
    [27] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv: 1904.08779, 2019.
    [28] K. Heafield, I. Pouzyrevsky, J. Clark, and P. Koehn, “Scalable modified kneser-ney language model estimation,” ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, vol. 2, pp. 690–696, 2013.
    [29] K. Heafield, “Kenlm: Faster and smaller language model queries,” Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197, 2011.
    [30] A. Maas, A. Hannun, D. Jurafsky, and A. Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns,” CoRR, abs/1408.2873, 2014.
    [31] “Sox, audio manipulation tool.” Available from http://sox.sourceforge.net/.
    [32] A. Zhang, “Speech recognition (version 3.8).” Available from https://github.com/Uberi/speech_recognition#readme, 2017.

    QR CODE