簡易檢索 / 詳目顯示

研究生: 林崇恩
Chong-En Lin
論文名稱: 基於語音和文本共享語意空間的高效Transformer架構非自迴歸語音辨識模型
Exploring Speech and Text Shared Semantic Space for Efficient Non-Autoregressive Transformer-based ASR
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 陳柏琳
Berlin Chen
洪志偉
Jeih-Weih Hung
曾厚強
Hou-Chiang Tseng
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 66
中文關鍵詞: 語音辨識語意學習非自迴歸模型Transformer
外文關鍵詞: Speech Recognition, Semantic Learning, Non-autoregressive Model, Transformer
相關次數: 點閱:280下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

端到端語音辨識模型主要分為自迴歸 (Autoregressive)模型和非自迴歸 (Non-Autoregressive)模型。在近幾年的研究中,非自迴歸模型的表現除了已經超越自迴歸模型之外,兩者最大的差異在於非自迴歸模型的辨識速度相比自迴歸模型會快上幾十倍、幾百倍,因此非自迴歸模型已經成為語音辨識領域的新興研究課題。
本論文首先介紹著名的自迴歸語音辨識模型和非自迴歸語音辨識模型,然後綜觀目前常見的非自迴歸語音辨識模型,我們發現目前的非自迴歸語音辨識模型嚴重缺乏語意的資訊和模型參數量龐大的問題。因此本論文提出一個新穎的非自迴歸端到端語音辨識模型,我們將其稱為基於語音和文本共享語意空間的高效Transformer架構非自迴歸語音辨識模型 (SATSS-NAT)。模型在此架構下可以高效的讓聲學特徵學習到語意的資訊,並且我們提出排程學習訓練方式來最大化模型學習文本資訊的能力。實驗結果顯示我們提出的SATSS-NAT和小規模參數的非自迴歸模型相比,在中文資料集AISHELL-1上達到了SOTA的成績。除此之外,SATSS-NAT的辨識速度會比自迴歸Transformer快約58倍,同時辨識錯誤率也大幅下降。


Autoregressive models and non-autoregressive models are the two primary categories of end-to-end speech recognition models. In recent years, the performance of the non-autoregressive models has surpassed that of the autoregressive models, and the main difference between them is that the recognition speed of non-autoregressive models is tens or hundreds of times faster than that of autoregressive models. As a result, non-autoregressive models are currently an emerging research topic in the field of speech recognition.
In this paper, we first introduce the well-known autoregressive speech recognition models and non-autoregressive speech recognition models, and then summarize the current common non-autoregressive speech recognition models, we find that current non-autoregressive speech recognition models are seriously lacking in semantic information and the problem of too many model parameters. Therefore, we propose a novel non-autoregressive end-to-end speech recognition model by exploring speech and text shared semantic space for efficient non-autoregressive transformer-based ASR, which is called SATSS-NAT. In this framework, the model allows acoustic features to efficiently learn semantic information, and we propose a schedule learning training approach to maximize the model's ability to learn textual information. The experimental results show that our proposed SATSS-NAT achieves state-of-the-art results on the AISHELL-1 compared with non-autoregressive models with small-scale parameters. In addition, the recognition speed of SATSS-NAT is about 58 times faster than that of the autoregressive-based Transformer, and the error rate is significantly reduced.

第 1 章 緒論 1 1.1 語音辨識 1 第 2 章 相關研究 3 2.1 Connectionist Temporal Classification 3 2.2 Speech-Transformer 6 2.3 Hybrid CTC/Attention End-to-End ASR 12 2.4 Mask CTC 14 2.5 Spike-Triggered Non-Autoregressive Transformer 17 2.6 Listen Attentively, and Spell Once 20 2.6 CTC Alignment-based Single Step Non-autoregressive Transformer 24 2.7 Non-autoregressive ASR With Pre-trained Models 28 第 3 章 研究方法 29 3.1 SATSS-NAT 29 3.1.1 模型描述 30 3.2 排程學習 37 3.3 訓練過程 38 3.4 解碼過程 40 第 4 章 實驗 41 4.1 實驗資料集 41 4.2 實驗設置 42 4.3 實驗結果 45 4.3.1 SATSS-NAT在AISHELL-1的實驗結果 45 4.3.2 SATSS-NAT在CSJ的實驗結果 47 4.3.3 SATSS-NAT在TED-LIUM 2的實驗結果 48 4.3.4 不同組合的排程學習在SATSS-NAT表現 48 4.3.5 文本資訊對SATSS-NAT的影響 49 4.3.6 融合遮罩對SATSS-NAT的影響 50 4.3.7 語意查詢對SATSS-NAT的影響 51 4.3.8 語意查詢個數對SATSS-NAT的影響 51 4.3.9 輔助語意特徵對SATSS-NAT的影響 52 4.3.10 預訓練文本資料集對SATSS-NAT的影響 53 4.3.11 動態文本萃取器的輸入個數準確性對SATSS-NAT的影響 54 4.3.12 編碼器端效能對SATSS-NAT的影響 57 第 5 章 結論與未來展望 59 第 6 章 參考文獻 61

[1] G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal processing magazine, vol. 29, no. 6, pp. 82-97, 2012.
[2] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," Advances in neural information processing systems, vol. 27, 2014.
[3] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, "Attention-based models for speech recognition," Advances in neural information processing systems, vol. 28, 2015.
[4] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, "End-to-end attention-based large vocabulary speech recognition," in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016: IEEE, pp. 4945-4949.
[5] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016: IEEE, pp. 4960-4964.
[6] Y. Zhang, W. Chan, and N. Jaitly, "Very deep convolutional networks for end-to-end speech recognition," in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017: IEEE, pp. 4845-4849.
[7] S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017: IEEE, pp. 4835-4839.
[8] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for end-to-end speech recognition," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
[9] L. Dong, S. Xu, and B. Xu, "Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 5884-5888.
[10] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
[11] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[12] J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher, "Non-autoregressive neural machine translation," arXiv preprint arXiv:1711.02281, 2017.
[13] R. Shu, J. Lee, H. Nakayama, and K. Cho, "Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior," in Proceedings of the aaai conference on artificial intelligence, 2020, vol. 34, no. 05, pp. 8846-8853.
[14] C. Saharia, W. Chan, S. Saxena, and M. Norouzi, "Non-autoregressive machine translation with latent alignments," arXiv preprint arXiv:2004.07437, 2020.
[15] N. Chen, S. Watanabe, J. Villalba, P. Żelasko, and N. Dehak, "Non-autoregressive transformer for speech recognition," IEEE Signal Processing Letters, vol. 28, pp. 121-125, 2020.
[16] W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly, "Imputer: Sequence modelling via imputation and dynamic programming," in International Conference on Machine Learning, 2020: PMLR, pp. 1403-1413.
[17] Y. Fujita, S. Watanabe, M. Omachi, and X. Chan, "Insertion-based modeling for end-to-end automatic speech recognition," arXiv preprint arXiv:2005.13211, 2020.
[18] Y. Higuchi, S. Watanabe, N. Chen, T. Ogawa, and T. Kobayashi, "Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict," arXiv preprint arXiv:2005.08700, 2020.
[19] Z. Tian, J. Yi, J. Tao, Y. Bai, S. Zhang, and Z. Wen, "Spike-triggered non-autoregressive transformer for end-to-end speech recognition," arXiv preprint arXiv:2005.07903, 2020.
[20] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang, "Listen attentively, and spell once: Whole sentence generation via a non-autoregressive architecture for low-latency speech recognition," arXiv preprint arXiv:2005.04862, 2020.
[21] R. Fan, W. Chu, P. Chang, and J. Xiao, "CASS-NAT: CTC alignment-based single step non-autoregressive transformer for speech recognition," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 5889-5893.
[22] D. Yu and J. Li, "Recent progresses in deep learning based acoustic models," IEEE/CAA Journal of automatica sinica, vol. 4, no. 3, pp. 396-409, 2017.
[23] A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE international conference on acoustics, speech and signal processing, 2013: Ieee, pp. 6645-6649.
[24] H. Sak, A. W. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," 2014.
[25] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Gated feedback recurrent neural networks," in International conference on machine learning, 2015: PMLR, pp. 2067-2075.
[26] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[27] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[28] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," in International conference on machine learning, 2017: PMLR, pp. 933-941.
[29] M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, "Mask-predict: Parallel decoding of conditional masked language models," arXiv preprint arXiv:1904.09324, 2019.
[30] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[32] Y. Goldberg and M. Elhadad, "An efficient algorithm for easy-first non-directional dependency parsing," in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 742-750.
[33] G. D. Forney, "The viterbi algorithm," Proceedings of the IEEE, vol. 61, no. 3, pp. 268-278, 1973.
[34] S. Lu, J. Lu, J. Lin, and Z. Wang, "A hardware-oriented and memory-efficient method for CTC decoding," IEEE Access, vol. 7, pp. 120681-120694, 2019.
[35] T. Q. Nguyen and J. Salazar, "Transformers without tears: Improving the normalization of self-attention," arXiv preprint arXiv:1910.05895, 2019.
[36] K. Sohn, "Improved deep metric learning with multi-class n-pair loss objective," Advances in neural information processing systems, vol. 29, 2016.
[37] A. M. Lamb, A. G. ALIAS PARTH GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio, "Professor forcing: A new algorithm for training recurrent networks," Advances in neural information processing systems, vol. 29, 2016.
[38] O. Russakovsky et al., "Imagenet large scale visual recognition challenge," International journal of computer vision, vol. 115, no. 3, pp. 211-252, 2015.
[39] Y. Liu et al., "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019.
[40] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," Advances in Neural Information Processing Systems, vol. 33, pp. 12449-12460, 2020.
[41] T. Brown et al., "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020.
[42] P. Baldi, "Autoencoders, unsupervised learning, and deep architectures," in Proceedings of ICML workshop on unsupervised and transfer learning, 2012: JMLR Workshop and Conference Proceedings, pp. 37-49.
[43] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, "Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline," in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), 2017: IEEE, pp. 1-5.
[44] K. Maekawa, "Corpus of Spontaneous Japanese: Its design and evaluation," in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003.
[45] A. Rousseau, P. Deléglise, and Y. Esteve, "Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks," in LREC, 2014, pp. 3935-3939.
[46] J. Du, X. Na, X. Liu, and H. Bu, "Aishell-2: Transforming mandarin asr research into industrial scale," arXiv preprint arXiv:1808.10583, 2018.
[47] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: an asr corpus based on public domain audio books," in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015: IEEE, pp. 5206-5210.
[48] D. S. Park et al., "Specaugment: A simple data augmentation method for automatic speech recognition," arXiv preprint arXiv:1904.08779, 2019.
[49] T. Kudo and J. Richardson, "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing," arXiv preprint arXiv:1808.06226, 2018.
[50] V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Icml, 2010.
[51] S. Watanabe et al., "Espnet: End-to-end speech processing toolkit," arXiv preprint arXiv:1804.00015, 2018.
[52] D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE 2011 workshop on automatic speech recognition and understanding, 2011, no. CONF: IEEE Signal Processing Society.
[53] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, "Self-attention transducers for end-to-end speech recognition," arXiv preprint arXiv:1909.13037, 2019.
[54] S. Karita et al., "A comparative study on transformer vs rnn in speech applications," in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019: IEEE, pp. 449-456.
[55] C. Shan et al., "Component fusion: Learning replaceable language model component for end-to-end speech recognition system," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 5361-5635.
[56] C.-Y. Yang and K.-Y. Chen, "A Re-thinking ASR Modeling Framework using Attention Mechanisms," in 2021 IEEE International Conference on Big Data (Big Data), 2021: IEEE, pp. 4530-4536.
[57] J. Nozaki and T. Komatsu, "Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions," arXiv preprint arXiv:2104.02724, 2021.
[58] K. Deng et al., "Improving CTC-based speech recognition via knowledge transferring from pre-trained language models," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 8517-8521.
[59] X. Song, Z. Wu, Y. Huang, C. Weng, D. Su, and H. Meng, "Non-autoregressive transformer ASR with CTC-enhanced decoder input," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 5894-5898.
[60] Z. Tian, J. Yi, J. Tao, S. Zhang, and Z. Wen, "Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition," IEEE Signal Processing Letters, vol. 29, pp. 762-766, 2022.
[61] R. Fan, W. Chu, P. Chang, J. Xiao, and A. Alwan, "An improved single step non-autoregressive transformer for automatic speech recognition," arXiv preprint arXiv:2106.09885, 2021.
[62] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang, "Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1897-1911, 2021.
[63] F.-H. Yu and K.-Y. Chen, "Non-autoregressive transformer-based end-to-end ASR using BERT," arXiv preprint arXiv:2104.04805, 2021.
[64] K. Deng, Z. Yang, S. Watanabe, Y. Higuchi, G. Cheng, and P. Zhang, "Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 8522-8526.
[65] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech recognition," arXiv preprint arXiv:2005.08100, 2020.
[66] G. Zheng, Y. Xiao, K. Gong, P. Zhou, X. Liang, and L. Lin, "Wav-BERT: Cooperative acoustic and linguistic representation learning for low-resource speech recognition," arXiv preprint arXiv:2109.09161, 2021.
[67] J. Weston, S. Chopra, and A. Bordes, "Memory networks," arXiv preprint arXiv:1410.3916, 2014.
[68] S. Sukhbaatar, J. Weston, and R. Fergus, "End-to-end memory networks," Advances in neural information processing systems, vol. 28, 2015.

無法下載圖示 全文公開日期 2025/08/30 (校內網路)
全文公開日期 2025/08/30 (校外網路)
全文公開日期 2025/08/30 (國家圖書館:臺灣博碩士論文系統)
QR CODE