簡易檢索 / 詳目顯示

研究生: 余福浩
FU-HAO YU
論文名稱: 基於Transformer之非自迴歸端對端語音辨識器
Non-autoregressive Transformer-based End-to-end ASR using Pre-trained Language Models
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 林伯慎
Bor-Shen Lin
陳柏琳
Ber-Lin Chen
王新民
Hsin-Min Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 47
中文關鍵詞: 語音辨識Transformer非自迴歸預訓練語言模型
外文關鍵詞: automatic speech recognition, Transformer, non-autoregressive, pre-trained language model
相關次數: 點閱:299下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

基於 Transformer 的模型在各大研究領域中都帶來了重大的創新,例如:語音訊號處理、自然語言處理和計算機視覺等。在 Transformer 被提出之後,基於注意力機制的端到端自動語音識別模型已儼然成為近年來的主流模型架構。除此之外,與傳統的自迴歸模型相比,非自迴歸的語音辨識架構不但能夠有良好的性能且同時擁有更快的解碼速度,因此成為了近期新興的研究方向之一。在自然語言處理領域中 BERT 模型受到了極大的關注,由於 BERT 模型蘊含豐富的上下文語意資訊,且僅需執行簡單的微調訓練就能夠在不同的下游任務上達到卓越的表現。
本論文介紹了數種著名之自迴歸與非自迴歸的端到端語音識別模型。為了繼承非自迴歸語音識別器的優點並同時從預訓練語言模型(例如 BERT)中獲益,本文提出兩種基於預訓練語言模型的非自迴歸端到端語音辨識模型。我們在中文語音資料集 AISHELL-1 上進行了一系列實驗證明我們所提出的模型與目前最先進的語音辨識模型相比,都達到了有競爭性或更好的結果。同時,我們還進行了一系列不同設置的比較實驗,以分析我們提出的模型的性能。最後我們也在更大型的中文數據集 AISHELL-2 上進行實驗,並同樣達到了最先進的結果。


Transformer-based models have led to significant innovation in various classic and practical subjects, including speech processing, natural language processing, and computer vision. On top of the Transformer, attention-based end-to-end automatic speech recognition (ASR) models have become a popular fashion in recent years. Specifically, non-autoregressive modeling, which can achieve fast inference speed and comparable performance when compared to conventional autoregressive methods, is an emergent research topic. In the context of natural language processing, the bidirectional encoder representations from Transformers (BERT) model has received widespread attention, partially due to its ability to infer contextualized word representations and to obtain superior performances of downstream tasks by performing only simple fine-tuning.
This paper introduces the famous end-to-end speech recognition models including those with autoregressive and non-autoregressive manners. In order to not only inherit the advantages of non-autoregressive ASR modeling but also receive benefits from a pre-trained language model (e.g., BERT), two non-autoregressive Transformer-based end-to-end ASR models based on BERT are presented herein. A series of experiments conducted on the AISHELL-1 dataset demonstrate competitive or superior results of the proposed model when compared to state-of-the-art ASR systems. At the same time, we conduct a series of comparative experiments with different settings to analyze the performance of our proposed model. Finally, the model is also experimented on a large Chinese dataset, AISHELL-2, and reaches a competitive result.

1. Introduction 1 2. Related Work 3 2.1. Autoregressive and Non-autoregressive Models 3 2.2. The AR Models 4 2.2.1. Listen Attend and Spell (LAS) 4 2.2.2. Speech Transformer 6 2.2.3. Hybrid CTC/Attention Architecture Model 8 2.2.4. BERT-ASR 10 2.3. The NAR Models 12 2.3.1. Connectionist Temporal Classification (CTC) 12 2.3.2. Mask CTC 15 2.3.3. Listen Attentively and Spell Once (LASO) 17 3. Proposed Methods 19 3.1. NAR-BERT-ASR 19 3.1.1. The Training Recipe and Settings 22 3.2. Residual NAR-BERT-ASR 25 4. Experiments 27 4.1. Experiment Dataset 27 4.2. Experimental Setup 28 4.3. Experimental Results 29 4.3.1. NAR-BERT-ASR Experimental Results on AISHELL-1 29 4.3.2. Ablation Studies of NAR-BERT-ASR 32 4.3.3. NAR-BERT-ASR Rescoring Experimental Results 33 4.3.4. The Impact of Different Model Dimensions of NAR-BERT-ASR 34 4.3.5. The Impact of Different Language Models of NAR-BERT-ASR 36 4.3.6. NAR-BERT-ASR Experimental Results on AISHELL-2 37 4.3.7. The Improvement of Residual NAR-BERT-ASR 38 4.3.8. Ablation Studies of Residual NAR-BERT-ASR 38 4.3.9. The Error Analysis 40 5. Conclusion 43 6. References 44

[1] A. Graves et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
[2] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[3] W. Chan et al., “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016: IEEE, pp. 4960-4964.
[4] S. Watanabe et al., “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
[5] A. N. Gomez et al., “Attention is all you need,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998-6008.
[6] Y. Liu, “Fine-tune BERT for extractive summarization,” arXiv preprint arXiv:1903.10318, 2019.
[7] V. Karpukhin et al., “Dense passage retrieval for open-domain question answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6769-6781.
[8] L. H. Li et al., “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
[9] T. Mikolov et al., “Efficient estimation of word representations in vector space,” in Proceedings of International Conference on Learning Representations (ICLR), 2013.
[10] J. Pennington et al., “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532-1543.
[11] J. Gu et al., “Non-autoregressive neural machine translation,” in Proceedings of International Conference on Learning Representations (ICLR), 2018.
[12] N. Chen et al., “Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition,” arXiv preprint arXiv:1911.04908, 2019.
[13] Y. Bai et al., “Listen attentively, and spell once: Whole sentence generation via a non-autoregressive architecture for low-latency speech recognition,” in Proceedings of INTERSPEECH, 2020, pp. 3381-3385.
[14] Y. Bai et al., “Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT,” arXiv preprint arXiv:2102.07594, 2021.
[15] Y. Higuchi et al., “Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict,” in Proceedings of INTERSPEECH, 2020, pp. 3655-3659.
[16] L. Dong et al., “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 5884-5888.
[17] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(NAACL-HLT), 2019, pp. 4171–4186.
[18] Z. Yang et al., “XLNet: Generalized autoregressive pretraining for language understanding,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 5754-5764.
[19] Y. Liu et al., “RoBERTa: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[20] S. Bengio et al., “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 1171–1179.
[21] H. Futami et al., “Distilling the knowledge of BERT for sequence-to-sequence ASR,” in Proceedings of INTERSPEECH, 2020, pp. 3635-3639.
[22] Y. N. Dauphin et al., “Language modeling with gated convolutional networks,” in International conference on machine learning, 2017: PMLR, pp. 933-941.
[23] T. Q. Nguyen and J. Salazar, “Transformers without tears: Improving the normalization of self-attention,” arXiv preprint arXiv:1910.05895, 2019.
[24] T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38-45.
[25] D. S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in Proceedings of INTERSPEECH, 2019, pp. 2613-2617.
[26] C. Shan et al., “Component fusion: Learning replaceable language model component for end-to-end speech recognition system,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 5361-5635.
[27] S. Karita et al., “A comparative study on transformer vs rnn in speech applications,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019: IEEE, pp. 449-456.
[28] Z. Tian et al., “Self-attention transducers for end-to-end speech recognition,” in Proceedings of INTERSPEECH, 2019, pp. 4395-4399.
[29] Z. Gao et al., “San-m: Memory equipped self-attention for end-to-end speech recognition,” in Proceedings of INTERSPEECH, 2020, pp. 6-10.
[30] Z. Fan et al., “Unsupervised pre-training for sequence to sequence speech recognition," arXiv preprint arXiv:1910.12418, 2019.
[31] K. An et al., “CAT: crf-based ASR toolkit,” arXiv preprint arXiv:1911.08747, 2019.
[32] W.-C. Huang et al., “Speech recognition by simply fine-tuning BERT,” arXiv preprint arXiv:2102.00291, 2021.
[33] X. Song et al., “Non-autoregressive transformer ASR with CTC-enhanced decoder input,” arXiv preprint arXiv:2010.15025, 2020.
[34] H. Bu et al., “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017: IEEE, pp. 1-5.
[35] G. Hinton et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[36] E. G. Ng et al., "Pushing the Limits of Non-Autoregressive Speech Recognition," arXiv preprint arXiv:2104.03416, 2021.
[37] J. Du et al., "Aishell-2: Transforming mandarin asr research into industrial scale," arXiv preprint arXiv:1808.10583, 2018.
[38] L. Dong and B. Xu, "CIF: Continuous integrate-and-fire for end-to-end speech recognition," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 6079-6083.
[39] S. Sun et al., "Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1826-1838, Nov. 2019, doi: 10.1109/TASLP.2019.2933146.
[40] A. Paszke et al., “Pytorch: An imperative style, highperformance deep learning library,” in Proceedings of NIPS, 2019, pages 8024–8035.
[41] S. Watanabe et al., “Espnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018.

無法下載圖示 全文公開日期 2024/10/04 (校內網路)
全文公開日期 2026/10/04 (校外網路)
全文公開日期 2026/10/04 (國家圖書館:臺灣博碩士論文系統)
QR CODE