簡易檢索 / 詳目顯示

研究生: 楊志穎
Chih-Yinh Yang
論文名稱: 基於自我與混合注意力之審議語音辨識模型
Deliberative ASR Modeling based on Self and Mixed Attention
指導教授: 陳冠宇
Guan-Yu Chen
口試委員: 王新民
Hsin-Min Wang
林柏慎
Bo-Shen Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 95
中文關鍵詞: 類神經網路端到端語音辨識模型深度學習
外文關鍵詞: neural network, end-to-end speech recognition, deep learning
相關次數: 點閱:320下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文介紹了多個膾炙人口的端到端語音辨識模型,從一開始出現的Connectionist Temporal Classification(CTC)模型、Recurrent Neural Network Transducer(RNN-T)模型,隨著注意力機制的興起,開始出現結合注意力機制的辨識模型,例如Listen Attend and Spell(LAS)模型、Neural Transducer模型,演化至今的是結合自注意力機制(self-attention)的Transformer模型以及基於Transformer模型架構的一系列端到端語音辨識模型。
    本論文提出一套以Transformer為基礎的端到端語音辨識模型,包含CTC模型、Transformer模型以及自我與混合注意力機制(self-and-mixed attention)。此外,本論文提出一套錯誤修正(error correction)的方法,藉由兩個不同的模型解碼器,將第一個模型所解碼出的標籤序列送入第二個解碼器做錯誤修正,進而得到一個更強健的預測序列。本論文將所提出的端到端語音辨識模型驗證在中文的語料庫Aishell-1上,測試集的字錯誤率(Character Error Rate, CER)從錯誤修正前的6.0%字錯誤率下降至5.8%字錯誤率,獲得更為精確的預測序列。


    This paper introduces a number of well-known end-to-end speech recognition models, the Connectionist Temporal Classification (CTC) model and the Recurrent Neural Network Transducer (RNN-T) model that appeared from the beginning. With the rise of the attention mechanism, end-to-end speech recognition models that combine the attention mechanism began to appear, such as the Listen Attend and Spell (LAS) model and the Neural Transducer model. The end-to-end speech recognition model evolved so far is the Transformer model combined with the self-attention mechanism and a series of end-to-end speech recognition models based on the Transformer model architecture.
    This paper proposes a set of Transformer-based end-to-end speech recognition models, including CTC model, Transformer model and self and mixed attention mechanism. In addition, this paper proposes a set of error correction methods. With two different model decoders, the label sequence decoded by the first model is sent to the second decoder for error correction, and then the result is a more robust prediction sequence. This paper verifies the proposed end-to-end speech recognition model on the Chinese corpus Aishell-1, and the character error rate (CER) of the test set decrease to 5.8% from 6.0%. Error rate, so as to obtain a more accurate prediction sequence.

    目錄 第 1 章 緒論 1 1.1 研究背景 1 1.1.1 自然語言處理 1 1.1.2 語音辨識 1 第 2 章 相關研究 3 2.1 Connectionist Temporal Classification base speech recognition 3 2.2 Recurrent Neural Aligner 13 2.3 RNN-T 14 2.4 Listen Attend and Spell 19 2.5 Neural transducer 27 2.6 混合CTC與LAS模型 28 2.6 Monotonic Chunkwise Attention 33 2.7 Transformer 35 2.8 混和CTC與Transformer模型 40 2.9 Continuous Integrate-and-Fire 40 2.10 Two Pass 模型 45 2.11 基於Two pass的審議模型 47 2.10 自我與混合注意力解碼器模型 49 2.11 端到端語音辨識工具 53 第 3 章 55 3.1 自我與混合注意力機制 55 3.2 審議模型 56 3.3 基於自我與混合注意力之審議模型 57 第 4 章 實驗架構 62 4.1 實驗訓練流程 62 4.2 實驗資料集 63 4.3 語音辨識系統 63 4.4 評分方式 63 4.5 實驗設定 64 第 5 章 實驗結果 66 5.1 基礎系統 66 5.2 基於自我與混合注意力之審議模型實驗結果 67 5.3 混合訓練之研究 69 5.4 注意力分布 70 5.5 深度解碼之研究 71 5.6 深層基礎系統之比較 72 5.7 基於自我與混合注意力之審議模型之消融研究 74 5.8 基於自我與混合注意力之審議模型與相關研究之比較 75 第 6 章 結論與未來展望 77 第 7 章 參考文獻 78


    [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, "A neural probabilistic language model," The journal of machine learning research, vol. 3, pp. 1137-1155, 2003.
    [2] T. Mikolov, S. Kombrink, L. Burget, J. Černocký, and S. Khudanpur, "Extensions of recurrent neural network language model," in 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2011: IEEE, pp. 5528-5531.
    [3] N. Xue, "Chinese word segmentation as character tagging," in International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, 2003, pp. 29-48.
    [4] A. Ratnaparkhi, "A maximum entropy model for part-of-speech tagging," in Conference on empirical methods in natural language processing, 1996.
    [5] A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE international conference on acoustics, speech and signal processing, 2013: Ieee, pp. 6645-6649.
    [6] B. H. Juang and L. R. Rabiner, "Hidden Markov models for speech recognition," Technometrics, vol. 33, no. 3, pp. 251-272, 1991.
    [7] M. Maybury, Advances in automatic text summarization. MIT press, 1999.
    [8] G. Salton, A. Singhal, M. Mitra, and C. Buckley, "Automatic text structuring and summarization," Information processing & management, vol. 33, no. 2, pp. 193-207, 1997.
    [9] R. Baeza-Yates and B. Ribeiro-Neto, Modern information retrieval. ACM press New York, 1999.
    [10] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to information retrieval. Cambridge University Press Cambridge, 2008.
    [11] P. F. Brown et al., "A statistical approach to machine translation," Computational linguistics, vol. 16, no. 2, pp. 79-85, 1990.
    [12] P. Koehn, Statistical machine translation. Cambridge University Press, 2009.
    [13] L. Hirschman and R. Gaizauskas, "Natural language question answering: the view from here," natural language engineering, vol. 7, no. 4, p. 275, 2001.
    [14] L. Muda, M. Begam, and I. Elamvazuthi, "Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques," arXiv preprint arXiv:1003.4083, 2010.
    [15] P. Flandrin, G. Rilling, and P. Goncalves, "Empirical mode decomposition as a filter bank," IEEE signal processing letters, vol. 11, no. 2, pp. 112-114, 2004.
    [16] T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur, "Recurrent neural network based language model," in Eleventh annual conference of the international speech communication association, 2010.
    [17] G. D. Forney, "The viterbi algorithm," Proceedings of the IEEE, vol. 61, no. 3, pp. 268-278, 1973.
    [18] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
    [19] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, "First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs," arXiv preprint arXiv:1408.2873, 2014.
    [20] H. Sak, M. Shannon, K. Rao, and F. Beaufays, "Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping," in Interspeech, 2017, vol. 8, pp. 1298-1302.
    [21] A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711, 2012.
    [22] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016: IEEE, pp. 4960-4964.
    [23] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
    [24] A. See, P. J. Liu, and C. D. Manning, "Get to the point: Summarization with pointer-generator networks," arXiv preprint arXiv:1704.04368, 2017.
    [25] M. W. Gardner and S. Dorling, "Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences," Atmospheric environment, vol. 32, no. 14-15, pp. 2627-2636, 1998.
    [26] N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio, "A neural transducer," arXiv preprint arXiv:1511.04868, 2015.
    [27] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for end-to-end speech recognition," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
    [28] S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017: IEEE, pp. 4835-4839.
    [29] C.-C. Chiu and C. Raffel, "Monotonic chunkwise attention," arXiv preprint arXiv:1712.05382, 2017.
    [30] A. Vaswani et al., "Attention is all you need," arXiv preprint arXiv:1706.03762, 2017.
    [31] L. Dong, S. Xu, and B. Xu, "Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 5884-5888.
    [32] L. Dong and B. Xu, "CIF: Continuous integrate-and-fire for end-to-end speech recognition," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 6079-6083.
    [33] T. N. Sainath et al., "Two-pass end-to-end speech recognition," arXiv preprint arXiv:1908.10992, 2019.
    [34] K. Hu, T. N. Sainath, R. Pang, and R. Prabhavalkar, "Deliberation model based two-pass end-to-end speech recognition," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 7799-7803.
    [35] X. Zhou, G. Lee, E. Yılmaz, Y. Long, J. Liang, and H. Li, "Self-and-mixed attention decoder with deep acoustic structure for transformer-based lvcsr," arXiv preprint arXiv:2006.10407, 2020.
    [36] S. Watanabe et al., "Espnet: End-to-end speech processing toolkit," arXiv preprint arXiv:1804.00015, 2018.
    [37] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, "Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline," in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017: IEEE, pp. 1-5.
    [38] D. Povey et al., "Purely sequence-trained neural networks for ASR based on lattice-free MMI," in Interspeech, 2016, pp. 2751-2755.
    [39] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, "Self-attention transducers for end-to-end speech recognition," arXiv preprint arXiv:1909.13037, 2019.
    [40] J. Salazar, K. Kirchhoff, and Z. Huang, "Self-attention networks for connectionist temporal classification in speech recognition," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 7115-7119.
    [41] S. Karita et al., "A comparative study on transformer vs rnn in speech applications," in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019: IEEE, pp. 449-456.
    [42] Z. Tian, J. Yi, J. Tao, Y. Bai, S. Zhang, and Z. Wen, "Spike-triggered non-autoregressive Transformer for end-to-end speech recognition," arXiv preprint arXiv:2005.07903, 2020.

    無法下載圖示 全文公開日期 2024/05/10 (校內網路)
    全文公開日期 2026/05/10 (校外網路)
    全文公開日期 2026/05/10 (國家圖書館:臺灣博碩士論文系統)
    QR CODE