簡易檢索 / 詳目顯示

研究生: 郭勁宏
Chin-Hung Kuo
論文名稱: 結合重新排序與錯誤修正的語音辨識後修正框架
A Speech Recognition Post-correction Framework Combining Reranking And Error Correction
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 陳柏琳
Berlin Chen
曾厚強
Hou-Chiang Tseng
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 112
語文別: 英文
論文頁數: 61
中文關鍵詞: 自動語音辨識重新排序錯誤修正
外文關鍵詞: Automatic speech recognition, Reranking, Error correction
相關次數: 點閱:48下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

近年來自動語音辨識系統(automatic speech recognition, ASR)被廣泛運用於各類電腦系統中,作為人類和機器之間的重要溝通管道,現今為了改善ASR的準確度,有兩類流行的後處理方法,作用於ASR的辨識結果上,其一為 N-best 重新排序(reranking),另一者為錯誤修正(error correction),考慮到兩方法的不同特性與互補的可能性,我們提出了一項結合兩者的框架名為CREAM,以追求更加優秀的表現。此框架包含由錯誤修正模型組成的文字修正模組 (text correction module)、由語言模型組成的文字重新評分模組 (text rescoring module)以及一個能夠在發音的角度上比較文字和語音的相似度的文字-語音比對模組(text-speech matching module),框架的運作首先由文字修正模組擴增原始的ASR N-best 列表,再由另外兩個模組共同為擴充後的列表進行重新排序。為了實驗的公平性,我們將該框架訓練並測試在一項公開的ASR辨識結果資料集(HypR)上,其中包括了中文與英文的語音語料庫的辨識結果,透過實驗我們證實了所提出方法的可行性,並透過實際案例清晰地解釋了每個模組的價值與貢獻。


In recent years, Automatic Speech Recognition (ASR) systems have become integral components of various computer systems, facilitating communication between humans and machines. To enhance ASR accuracy, two popular post-processing methods—N-best reranking and error correction—have emerged, each focusing on refining the recognition output of ASR. Recognizing the distinct characteristics and potential complementarity of these two methods, we propose a comprehensive framework called CREAM to achieve superior performance. CREAM comprises a text correction module incorporating error correction models, a text rescoring module utilizing language models, and a text-speech matching module assessing the similarity between text and speech from a pronunciation perspective. This framework's operation involves the text correction module expanding the original ASR N-best list, followed by the other two modules' collaborative reranking of the augmented list. To ensure experimental fairness, we trained and tested this framework using a publicly available ASR recognition results (HypR) dataset, encompassing recognition outcomes from both Chinese and English speech corpora. Our experiments validated the feasibility of the proposed approach and clarified the value and contributions of each module through an actual case study.

Recommendation Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Approval Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 CTC-based ASR . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Attention-based ASR . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Hybrid CTC/Attention ASR . . . . . . . . . . . . . . . . . . . . 11 2.3 Reranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Language Model For Reranking . . . . . . . . . . . . . . . . . . 13 2.3.2 Discriminative Model For Reranking . . . . . . . . . . . . . . . 17 2.3.3 Large Language Model For Reranking . . . . . . . . . . . . . . . 18 2.4 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Early Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2 Current Error Correction Models . . . . . . . . . . . . . . . . . . 20 3 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 CREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 Text Correction Module . . . . . . . . . . . . . . . . . . . . . . 26 3.1.2 Text Rescoring Module . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3 Text-speech Matching Module . . . . . . . . . . . . . . . . . . . 28 3.1.4 Multi-Modal Rescoring Model . . . . . . . . . . . . . . . . . . . 29 3.1.5 Knowledge Distillation For Text-speech Matching . . . . . . . . 29 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 Benchmark and Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.1 Text Correction Module . . . . . . . . . . . . . . . . . . . . . . 35 4.2.2 Text Rescoring Module . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.3 Text-speech Matching Module . . . . . . . . . . . . . . . . . . . 38 4.2.4 Multi-Modal Rescoring Model . . . . . . . . . . . . . . . . . . . 39 4.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Results on AISHELL-1 . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.2 Results on TEDLIUM-2 and LibriSpeech . . . . . . . . . . . . . 45 4.3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.5 The Impact of The Weights of Reranking Modules . . . . . . . . 51 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

[1] V. Kepuska and G. Bohouta, “Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home),” in 2018 IEEE 8th annual computing and communication workshop and conference (CCWC), pp. 99–103, IEEE, 2018.
[2] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, pp. 28492–28518, PMLR, 2023.
[3] D. Rekesh, S. Kriman, S. Majumdar, V. Noroozi, H. Juang, O. Hrinchuk, A. Kumar, and B. Ginsburg, “Fast conformer with linearly scalable attention for efficient speech recognition,” arXiv preprint arXiv:2305.05084, 2023.
[4] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
[5] J. Twiefel, T. Baumann, S. Heinrich, and S. Wermter, “Improving domain-independent cloud-based speech recognition with domain-dependent phonetic post-processing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28, 2014.
[6] J. Du, S. Pu, Q. Dong, C. Jin, X. Qi, D. Gu, R. Wu, and H. Zhou, “Cross-modal asr post-processing system for error correction and utterance rejection,” arXiv preprint arXiv:2201.03313, 2022.
[7] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method using bert for speech recognition,” in Asian Conference on Machine Learning, pp. 1081–1093, PMLR, 2019.
[8] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked language model scoring,” arXiv preprint arXiv:1910.14659, 2019.
[9] T. Udagawa, M. Suzuki, G. Kurata, N. Itoh, and G. Saon, “Effect and analysis of large-scale language model rescoring on competitive asr systems,” arXiv preprint arXiv:2204.00212, 2022.
[10] J. Cai, M. Sunkara, X. Li, A. Bhatia, X. Pan, and S. Bodapati, “Masked audio text encoders are effective multi-modal rescorers,” arXiv preprint arXiv:2305.07677, 2023.
[11] S.-H. Chiu and B. Chen, “Innovative bert-based reranking language models for speech recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 266–271, IEEE, 2021.
[12] D. Fohr and I. Illina, “Bert-based semantic model for rescoring n-best speech recognition list,” in INTERSPEECH 2021, 2021.
[13] L. Xu, Y. Gu, J. Kolehmainen, H. Khan, A. Gandhe, A. Rastrow, A. Stolcke, and I. Bulyko, “Rescorebert: Discriminative speech recognition rescoring with bert,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6117–6121, IEEE, 2022
[14] P. G. Shivakumar, J. Kolehmainen, Y. Gu, A. Gandhe, A. Rastrow, and I. Bulyko, “Discriminative speech recognition rescoring with pre-trained language models,” arXiv preprint arXiv:2310.06248, 2023.
[15] R. Errattahi, A. El Hannani, and H.skar, N. V. Meripo, S. Konam Ouahmane, “Automatic speech recognition errors detection and correction: A review,” Procedia Computer Science, vol. 128, pp. 32–37, 2018.
[16] A. Mani, S. Pala, and F. Metze, “Asr error correction and domain adaptation using machine translation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6344–6348, IEEE, 2020.
[17] L. Zhu, W. Liu, L. Liu, and E. Lin, “Improving asr error correction using n-best hypotheses,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 83–89, IEEE, 2021.
[18] Y. Zhao, X. Yang, J. Wang, Y. Gao, C. Yan, and Y. Zhou, “Bart based semantic correction for mandarin automatic speech recognition system,” arXiv preprint arXiv:2104.05507, 2021.
[19] Y. Leng, X. Tan, L. Zhu, J. Xu, R. Luo, L. Liu, T. Qin, X. Li, E. Lin, and T.-Y. Liu, “Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,” Advances in Neural Information Processing Systems, vol. 34, pp. 21708–21719, 2021.
[20] Y. Leng, X. Tan, R. Wang, L. Zhu, J. Xu, W. Liu, L. Liu, T. Qin, X.-Y. Li, E. Lin, et al., “Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition,” arXiv preprint arXiv:2109.14420, 2021.
[21] S. Dutta, S. Jain, A. Maheshwari, S. Pal, G. Ramakrishnan, and P. Jyothi, “Error correction in asr using sequence-to-sequence models,” arXiv preprint arXiv:2202.01157, 2022.
[22] R. Ma, M. J. Gales, K. M. Knill, and M. Qian, “N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space,” arXiv preprint arXiv:2303.00456, 2023.
[23] Y. Leng, X. Tan, W. Liu, K. Song, R. Wang, X.-Y. Li, T. Qin, E. Lin, and T.-Y. Liu, “Softcorrect: Error correction with soft detection for automatic speech recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13034–13042, 2023.
[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[25] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014.
[26] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[28] R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
[29] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, pp. 369–376, 2006.
[30] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[31] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping.,” in Interspeech, vol. 8, pp. 1298–1302, 2017.
[32] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
[33] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4960–4964, IEEE, 2016.
[34] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5884–5888, IEEE, 2018.
[35] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
[36] L. Dong, C. Yi, J. Wang, S. Zhou, S. Xu, X. Jia, and B. Xu, “A comparison of label-synchronous and frame-synchronous end-to-end models for speech recognition,” arXiv preprint arXiv:2005.10113, 2020.
[37] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,”Computer Speech & Language, vol. 13, no. 4, pp. 359–394, 1999.
[38] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[40] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[41] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[42] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[43] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022.
[44] B. Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, et al., “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
[45] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[46] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[47] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
[48] R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can generative large language models perform asr error correction?,” arXiv preprint arXiv:2307.04172, 2023.
[49] C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and E. S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,” arXiv preprint arXiv:2309.15701, 2023.
[50] Y.-W. Wang, K.-H. Lu, and K.-Y. Chen, “Hypr: A comprehensive study for asr hypothesis revising with a reference corpus,” arXiv preprint arXiv:2309.09838, 2023.
[51] T. Kemp, T. Schaaf, et al., “Estimating confidence using word lattices.,” in EuroSpeech, pp. 827–830, Citeseer, 1997.
[52] A. Allauzen, “Error detection in confusion network,” in Eighth Annual Conference of the International Speech Communication Association, 2007.
[53] F. Wessel, R. Schluter, K. Macherey, and H. Ney, “Confidence measures for large vocabulary continuous speech recognition,” IEEE Transactions on speech and audio processing, vol. 9, no. 3, pp. 288– 298, 2001.
[54] H. Jiang, “Confidence measures for speech recognition: A survey,” Speech communication, vol. 45, no. 4, pp. 455–470, 2005.
[55] L. Zhou, Y. Shi, J. Feng, and A. Sears, “Data mining for detecting errors in dictation speech recognition,” IEEE transactions on speech and audio processing, vol. 13, no. 5, pp. 681–688, 2005.
[56] T. Pellegrini and I. Trancoso, “Improving asr error detection with non-decoder based features,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
[57] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
[58] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
[59] Z. Min and J. Wang, “Exploring the integration of large language models into automatic speech recognition systems: An empirical study,” in International Conference on Neural Information Processing, pp. 69–84, Springer, 2023.
[60] J. Pu, T.-S. Nguyen, and S. Stüker, “Multi-stage large language model correction for speech recognition,” arXiv preprint arXiv:2310.11532, 2023.
[61] C. Chen, Y. Hu, C.-H. H. Yang, H. Liu, S. M. Siniscalchi, and E. S. Chng, “Generative error correction for code-switching speech recognition using large language models,” arXiv preprint arXiv:2310.13013, 2023.
[62] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
[63] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
[64] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp. 1–5, IEEE, 2017.
[65] A. Rousseau, P. Deléglise, Y. Esteve, et al., “Enhancing the ted-lium corpus with selected data for language modeling and more ted talks.,” in LREC, pp. 3935–3939, 2014.
[66] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210, IEEE, 2015.
[67] Y. Lee, S. Shon, and T. Kim, “Learning pronunciation from a foreign language in speech synthesis networks,” arXiv preprint arXiv:1811.09364, 2018.
[68] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[69] B. Chen, G. Xu, X. Wang, P. Xie, M. Zhang, and F. Huang, “Aishell-ner: Named entity recognition from chinese speech,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8352–8356, IEEE, 2022.

無法下載圖示
全文公開日期 2029/02/02 (校外網路)
全文公開日期 2029/02/02 (國家圖書館:臺灣博碩士論文系統)
QR CODE