簡易檢索 / 詳目顯示

研究生: 盧克函
Ke-Han Lu
論文名稱: 上下文知識增強的連結時序分類語音辨識模型
A Contextual Knowledge-enhanced CTC-based ASR Framework
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 王新民
Hsin-Min Wang
王緒翔
Syu-Siang Wang
林伯慎
Bor-Shen Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 84
中文關鍵詞: 自動語音辨識知識蒸餾深度學習連結時序分類非自迴歸語音辨識器
外文關鍵詞: Automatic speech recognition, Knowledge distillation, Deep learning, Connectionist temporal classification, Non-autoregressive speech recognizer
相關次數: 點閱:232下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

自動語音辨識模型(Automatic speech recognition, ASR)的目的是將語音訊號轉換為對應的文字,其對聲學的特徵和文字的前後文意有著很強的學習能力才能夠整合語音與文字兩個模態之間的差異。在近幾年的研究中,由於基於 Connectionist Temporal Classification (CTC)的非自迴歸(Non-autoregressive)語音辨識模型有著快速的辨識速度,使其成為語音辨識領域中非常熱門的主題。另一方面,基於自監督學習的大型預訓練模型在各個領域中扮演著很重要的角色,例如自然語言處理、計算機視覺以及語音處理領域,許多最先進的方法都是基於這類模型的衍生而達到更好的表現。

本論文介紹了近期端到端語音辨識領域的重要發展,包含數種著名的自迴歸與非字迴歸語音辨識模型,以及介紹一些整合了許多應用預訓練語言模型於語音辨識系統的方法。為了解決 CTC 方法的條件獨立假設導致模型無法有效學習到上下文資訊的問題,為此我們提出一個新穎的語意增強知識轉移架構,將預訓練語言模型(例如 BERT)所學習到的語意知識轉移到基於 CTC 的模型中,為了達到更好的效果,我們也將預訓練聲學模型(例如 wav2vec2.0)應用於此系統中。我們在中文的 AISHELL-1 以及英文的 TEDLIUM-2 資料集上進行了一系列詳盡的實驗,實驗結果說明我們所提出的架構與其他最先進的方法相比皆展現了有競爭力或更好的效果與辨識效率。此外,我們在工業等級的中文資料集 AISHELL-2 上測試了我們的方法也取得了更好的表現。最後我們透過一系列的消融實驗與分析驗證了我們的方法的有效性。


Automatic speech recognition (ASR) is the process of accurately converting spoken language into written text, which involves effectively mitigating the discrepancy between two modalities. This requires the models to have a strong understanding of both the acoustic features and the contextual coherence of text.
Recently, Connectionist Temporal Classification (CTC) has become a popular method for training non-autoregressive end-to-end speech recognition (ASR) models due to their faster decoding speed. Additionally, in the context of self-supervised learning, large-scale pre-trained models have played important roles in various research fields for building the most advanced systems, such as natural language processing, computer vision and speech processing.

In this paper, we will introduce recent developments in the field of automatic speech recognition (ASR), including several notable autoregressive and non-autoregressive models. We will also introduce approaches that incorporate pre-trained language models into ASR systems. Motivated by the previous research, to mitigate the conditional independence assumption of CTC which makes the CTC-based model challenge in capturing contextual information.
We propose a novel context-aware knowledge transferring framework that transfers the contextual knowledge from a pre-trained language model (e.g., BERT) into CTC-based ASR. To achieve better performance, a pre-trained acoustic model (e.g., wav2vec2.0) is used for building the ASR system. A series of experiments conducted on the Chinese AISHELL-1 and English TEDLIUM-2 datasets demonstrate comparable or superior performance and efficiency when compare to the state-of-the-art systems. Additionally, we also evaluate the performance of large-scale AISHELL-2 dataset. Finally, some comprehensive ablation studies and analyses are conducted to verify the effectiveness of our proposed method.

Recommendation Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Approval Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 End-to-end Automatic Speech Recognition . . . . . . . . . . . . 6 2.2 The Autoregressive model . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Speech Transformer . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Hybrid CTC/Attention Encoder-Decoder Architecture . . . . . . 12 2.3 The Non-Autoregressive Model . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 CTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Intermediate CTC . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 Mask-CTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4 LASO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Integrating Pre-trained Models into ASR . . . . . . . . . . . . . . . . . . 24 2.4.1 Pre-trained Language Model . . . . . . . . . . . . . . . . . . . . 24 2.4.2 Pre-trained Acoustic Model . . . . . . . . . . . . . . . . . . . . 27 2.4.3 The Cascade Style Method . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 The Knowledge Distillation Style Model . . . . . . . . . . . . . 33 3 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Context-aware Knowledge Transferring Strategy . . . . . . . . . . . . . 39 3.1.1 CTC-based wav2vec2.0 . . . . . . . . . . . . . . . . . . . . . . 39 3.1.2 Token-dependent Knowledge Transferring Module . . . . . . . . 40 3.1.3 Context-aware Training Strategy . . . . . . . . . . . . . . . . . . 43 3.1.4 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . 44 4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 Pre-trained wav2vec2.0 . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.2 The Automatic Speech Recognition System . . . . . . . . . . . . 49 4.2.3 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Results on AISHELL-1 Dataset . . . . . . . . . . . . . . . . . . 51 4.3.2 Results on AISHELL-2 Dataset . . . . . . . . . . . . . . . . . . 54 4.3.3 Results on TEDLIUM-2 Dataset . . . . . . . . . . . . . . . . . . 55 4.3.4 Ablation Study on CAKT . . . . . . . . . . . . . . . . . . . . . 58 4.3.5 The Impact of Weight Initialization Technique . . . . . . . . . . 60 4.3.6 The Impact of Contextual BERT Target . . . . . . . . . . . . . . 61 4.3.7 The Impact of Pre-trained wav2vec2.0 . . . . . . . . . . . . . . . 62 5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 Attention Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Self-similarity Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

[1] Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, and Shinji Watanabe, “A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 47–54.
[2] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,”.
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems. vol. 30, Curran Associates, Inc.
[4] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Interspeech 2020. pp. 5036–5040, ISCA.
[5] Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” p. 8.
[6] Jaesong Lee and Shinji Watanabe, “Intermediate Loss Regularization for CTC-Based Speech Recognition,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6224–6228.
[7] Jumon Nozaki and Tatsuya Komatsu, “Relaxing the Conditional Independence Assumption of CTC-Based ASR by Conditioning on Intermediate Predictions,” in Interspeech 2021. pp. 3735–3739, ISCA.
[8] Yosuke Higuchi, Keita Karube, Tetsuji Ogawa, and Tetsunori Kobayashi, “Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7797–7801.
[9] Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsunori Kobayashi, “Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict,” in Interspeech 2020. pp. 3655–3659, ISCA.
[10] Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, and Tetsunori Kobayashi, “Improved Mask-CTC for Non-Autoregressive End-to-End ASR,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8363–8367.
[11] Xingchen Song, Zhiyong Wu, Yiheng Huang, Chao Weng, Dan Su, and Helen Meng, “Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5894–5898.
[12] Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, and Shinji Watanabe, “BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model,” .
[13] Keqi Deng, Songjun Cao, Yike Zhang, Long Ma, Gaofeng Cheng, Ji Xu, and Pengyuan Zhang, “Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8517–8521.
[14] Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, “Distilling the Knowledge of BERT for CTC-based ASR,” .
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186, Association for Computational Linguistics.
[16] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, “Language Models are Unsupervised Multitask Learners,” p. 24.
[17] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” in Advances in Neural Information Processing Systems. vol. 32, Curran Associates, Inc.
[18] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” in Proceedings of the 37th International Conference on Machine Learning. pp. 1597–1607, PMLR.
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” .
[20] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli, “Wav2vec: Unsupervised Pre-Training for Speech Recognition,” in Interspeech 2019. pp. 3465–3469, ISCA.
[21] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Advances in Neural Information Processing Systems. vol. 33, pp. 12449–12460, Curran Associates, Inc.
[22] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “HuBERT: Self-Supervised Speech
Representation Learning by Masked Prediction of Hidden Units,” vol. 29, pp. 3451–3460.
[23] Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji Watanabe, Marc Delcroix, Atsunori Ogawa, and Tomohiro Nakatani, “Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration,” in Interspeech 2019. pp. 1408–1412, ISCA.
[24] Keqi Deng, Songjun Cao, Yike Zhang, and Long Ma, “Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 76–82.
[25] Keqi Deng, Zehui Yang, Shinji Watanabe, Yosuke Higuchi, Gaofeng Cheng, and Pengyuan Zhang, “Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8522–8526.
[26] Fu-Hao Yu, Kuan-Yu Chen, and Ke-Han Lu, “Non-Autoregressive ASR Modeling Using Pre-Trained Language Models for Chinese Speech Recognition,” vol. 30, pp. 1474–1482.
[27] Wen-Chin Huang, Chia-Hua Wu, Shang-Bao Luo, Kuan-Yu Chen, Hsin-Min Wang, and Tomoki Toda, “Speech Recognition by Simply Fine-Tuning Bert,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7343–7347.
[28] Cheng Yi, Shiyu Zhou, and Bo Xu, “Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition,” vol. 28, pp. 788–792.
[29] Guolin Zheng, Yubei Xiao, Ke Gong, Pan Zhou, Xiaodan Liang, and Liang Lin, “Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition,” in Findings of the Association for Computational Linguistics: EMNLP 2021. pp. 2765–2777, Association for Computational Linguistics.
[30] Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, and Shinji Watanabe, “BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder,” .
[31] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean, “Distilling the Knowledge in a Neural Network,” in NIPS Deep Learning and Representation Learning Workshop.
[32] Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, “Distilling the Knowledge of BERT for Sequence-to-Sequence ASR,” in Interspeech 2020. pp. 3635–3639, ISCA.
[33] Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, and Shuai Zhang, “Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT,” vol. 29, pp. 1897–1911.
[34] Keqi Deng, Gaofeng Cheng, Runyan Yang, and Yonghong Yan, “Alleviating ASR Long-Tailed Problem by Decoupling the Learning of Representation and Classification,” vol. 30, pp. 340–354.
[35] Yotaro Kubo, Shigeki Karita, and Michiel Bacchiani, “Knowledge Transfer from Large-Scale Pretrained Language Models to End-To-End Speech Recognizers,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8512–8516.
[36] Joonbo Shin, Yoonhyung Lee, and Kyomin Jung, “Effective Sentence Scoring Method Using BERT for Speech Recognition,” in Proceedings of The Eleventh Asian Conference on Machine Learning. pp. 1081–1093, PMLR.
[37] Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff, “Masked Language Model Scoring,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 2699–2712, Association for Computational Linguistics.
[38] Shih-Hsuan Chiu and Berlin Chen, “Innovative Bert-Based Reranking Language Models for Speech Recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 266–271.
[39] Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, “ASR Rescoring and Confidence Estimation with Electra,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 380–387.
[40] Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Nobuyasu Itoh, and George Saon, “Effect and Analysis of Large-scale Language Model Rescoring on Competitive ASR Systems,” in Interspeech 2022. pp. 3919–3923, ISCA.
[41] Yichong Leng, Xu Tan, Linchen Zhu, Jin Xu, Renqian Luo, Linquan Liu, Tao Qin, Xiangyang Li, Edward Lin, and Tie-Yan Liu, “FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition,” in Advances in Neural Information Processing Systems. vol. 34, pp. 21708–21719, Curran Associates, Inc.
[42] Yun Zhao, Xuerui Yang, Jinchao Wang, Yongyu Gao, Chao Yan, and Yuanfu Zhou, “BART Based Semantic Correction for Mandarin Automatic Speech Recognition System,” in Interspeech 2021. pp. 2017–2021, ISCA.
[43] Alex Graves and Navdeep Jaitly, “Towards End-To-End Speech Recognition with Recurrent Neural Networks,” in Proceedings of the 31st International Conference on Machine Learning. pp. 1764–1772, PMLR.
[44] Alex Graves, “Sequence Transduction with Recurrent Neural Networks,” .
[45] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649.
[46] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964.
[47] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888.
[48] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko, “End-to-End Object Detection with Transformers,”in Computer Vision –ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox,and Jan-Michael Frahm, Eds., vol. 12346 of Lecture Notes in Computer Science, pp.213–229. Springer International Publishing.
[49] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei, “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901, Curran Associates, Inc.
[50] Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate: 3rd International Conference on Learning Representations, ICLR 2015,” .
[51] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, “Layer Normalization,”.
[52] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image Recognition,” pp. 770–778.
[53] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition,” vol. 1, no. 4, pp. 541–551.
[54] Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory,” vol. 9, no.8, pp. 1735–1780.
[55] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” .
[56] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, and Tomoki Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” vol. 11, no. 8, pp. 1240–1253.
[57] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, “Hybrid speech recognition with Deep Bidirectional LSTM,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 273–278.
[58] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” in Interspeech 2018. pp. 2207–2211, ISCA.
[59] Santiago Fernández, “Sequence Labelling in Structured Domains with Hierarchical Recurrent Neural Networks,” p. 6.
[60] Shubham Toshniwal, Hao Tang, Liang Lu, and Karen Livescu, “Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition,” in Interspeech 2017. pp. 3532–3536, ISCA.
[61] Ramon Sanabria and Florian Metze, “Hierarchical Multitask Learning With CTC,” in 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 485–490.
[62] Santiago Fernández, Alex Graves, and Jürgen Schmidhuber, “Sequence labelling in structured domains with hierarchical recurrent neural networks,” in Proceedings of the 20th International Joint Conference on Artifical Intelligence. IJCAI’07, pp.774–779, Morgan Kaufmann Publishers Inc.
[63] Kanishka Rao and Haşim Sak, “Multi-accent speech recognition with hierarchical grapheme based models,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4815–4819.
[64] Kalpesh Krishna, Shubham Toshniwal, and Karen Livescu, “Hierarchical Multitask Learning for CTC-based Speech Recognition,” .
[65] Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, and Denny Zhou, “Fast WordPiece Tokenization,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 2089–2103, Association for Computational Linguistics.
[66] Taku Kudo and John Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 66–71, Association for Computational Linguistics.
[67] Hagen Soltau, Hank Liao, and Haşim Sak, “Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition,” in Interspeech 2017. pp. 3707–3711, ISCA.
[68] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer, “Mask-Predict: Parallel Decoding of Conditional Masked Language Models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 6112–6121, Association for Computational Linguistics.
[69] “CASS-NAT: CTC Alignment-Based Single Step Non-Autoregressive Transformer for Speech Recognition | IEEE Conference Publication | IEEE Xplore,” .
[70] Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, and Abeer Alwan, “An Improved
Single Step Non-Autoregressive Transformer for Automatic Speech Recognition,” in Interspeech 2021. pp. 3715–3719, ISCA.
[71] Wilson L. Taylor, ““Cloze Procedure”: A New Tool for Measuring Readability,” vol. 30, no. 4, pp. 415–433.
[72] Victor Sanh, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” p. 5.
[73] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu, “TinyBERT: Distilling BERT for Natural Language Understanding,” in Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 4163–4174, Association for Computational Linguistics.
[74] Alexei Baevski, Steffen Schneider, and Michael Auli, “Vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations,” .
[75] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Interspeech 2021. pp. 1194–1198, ISCA.
[76] Joseph F. DeRose, Jiayao Wang, and Matthew Berger, “Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models,” vol. 27, no. 2, pp. 1160–1170.
[77] Wietse de Vries, Andreas van Cranenburgh, and Malvina Nissim, “What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models,” in Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 4339–4350, Association for Computational Linguistics.
[78] Betty van Aken, Benjamin Winter, Alexander Löser, and Felix A. Gers, “How
Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management. CIKM ’19, pp. 1823–1832, Association for Computing Machinery.
[79] Anna Rogers, Olga Kovaleva, and Anna Rumshisky, “A Primer in BERTology: What We Know About How BERT Works,” vol. 8, pp. 842–866.
[80] John Hewitt and Christopher D. Manning, “A Structural Probe for Finding Syntax in Word Representations,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4129–4138, Association for Computational Linguistics.
[81] Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky, “Revealing the Dark Secrets of BERT,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 4365–4374, Association for Computational Linguistics.
[82] Linhao Dong and Bo Xu, “CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6079–6083.
[83] Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang, “DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR,” .
[84] Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun, “Anchor DETR: Query Design for Transformer-Based Detector,” vol. 36, no. 3, pp. 2567–2575.
[85] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang, “Conditional DETR for Fast Training Convergence,” pp. 3651–3660.
[86] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5.
[87] Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu, “AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale,” .
[88] Anthony Rousseau, Paul Deléglise, and Yannick Estève, “Enhancing the TEDLIUM Corpus with Selected Data for Language Modeling and More TED Talks,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). pp. 3935–3939, European Language Resources Association (ELRA).
[89] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean, “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” .
[90] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli, “Fairseq: A Fast, Extensible Toolkit for Sequence Modeling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). pp. 48–53, Association for Computational Linguistics.
[91] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210.
[92] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Can-wen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush, “Transformers: State-of-the-Art Natural Language Processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45, Association for Computational Linguistics.
[93] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Audio augmentation for speech recognition,” in Interspeech 2015. pp. 3586–3589, ISCA.
[94] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D.Cubuk, and Quoc V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Interspeech 2019, pp. 2613–2617.
[95] Yoav Goldberg, “Assessing BERT’s Syntactic Abilities,” .
[96] Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith, “Linguistic Knowledge and Transferability of Contextual Representations,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 1073–1094, Association for Computational Linguistics.
[97] Zeyu Zhao and Peter Bell, “Investigating Sequence-Level Normalisation For CTC-Like End-to-End ASR,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7792–7796.

QR CODE