簡易檢索 / 詳目顯示

研究生: 謝謦暘
Ching-Yang Hsieh
論文名稱: 編解碼器模型中注意力機制與訓練排程之改進-應用於英語音素辨識
Improvement of Attention Mechanism and Training Schedule of Encoder Decoder Model for English Phoneme Recognition
指導教授: 林伯慎
Bor-Shen Lin
口試委員: 陳柏琳
Berlin Chen
楊傳凱
Chuan-Kai Yang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 50
中文關鍵詞: 英語音素辨識編解碼器模型注意力機制排程抽樣注意力引導
外文關鍵詞: Seq2Seq Model, Phoneme Recognition, Guided Attention Learning, Training Schedule
相關次數: 點閱:236下載:10
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 早期語音辨識系統多是使用基於高斯混合模型的隱馬可夫模型,此模型是以馬可夫模型來模擬音素狀態序列的產生,再使用高斯混合模型來計算語音特徵和狀態之間的匹配度。近年來序列到序列模型發展迅速,其中以編解碼器模型最受矚目,它已被成功應用於許多序列預測問題上。本論文研究具注意力機制的編解碼器在語音辨識的應用,著重於探討注意力機制和訓練排程對音素辨識效能的影響。我們使用了三種注意力機制以及三種模型訓練策略,作為實驗基線;實驗結果顯示General注意力機制的辨識效能最為穩定,而排程抽樣訓練策略可以有效提升效能,但需要調校初始的自學比例。進一步,我們針對音素分類層的輸入特徵進行實驗,發現單使用前後文向量會得到最好的效能。另外,我們也提出了使用GMM-HMM強迫對齊資訊,在注意力機制上加入「前後文損失」目標函數,來引導注意力學習。實驗結果發現,使用注意力引導對於各種網路結構均能有效提升辨識效能,是一種可靠的方法。最後,在雙向三層LSTM編碼器下,使用預訓練與排程抽樣、加入注意力引導學習、並以前後文向量做音素分類,對TIMIT英語語料庫39個英語音素的辨識正確率可達到77.409%。


    Early speech recognition systems mostly used a hidden Markov model based on a Gaussian mixture model. This model uses a Markov model to simulate the generation of phoneme state sequences, and then uses a Gaussian mixture model to calculate the matching degree between voice features and states. In recent years, sequence-to-sequence models have developed rapidly. Among them, the encoder-decoder model has attracted the most attention. It has been successfully applied to many sequence prediction problems.
    This thesis studies the application of encoder-decoder with attention mechanism in speech recognition, focusing on the impact of attention mechanism and training schedule on phoneme recognition performance. We used three attention mechanisms and three model training strategies as our experiment baseline. The experiment results show that the recognition performance of the general attention mechanism is the most stable, and the scheduled sampling can effectively improve the performance, but the initial self-study ratio needs to be adjusted. Furthermore, we conducted experiments on the input features of the phoneme classification layer, and found that using context vectors alone would get the best performance. In addition, we proposed the use of GMM-HMM to force the alignment of information, adding a "context loss" objective function to the attention mechanism to guide attention learning. Through experiments, we found that the use of attention guidance can effectively improve the recognition performance for various network structures, that proves our method is reliable. Finally, under a two-way three-layer LSTM encoder, using pre-training and scheduled sampling, adding attention guided learning, and classifying phonemes with context vectors, the recognition accuracy of 39 phonemes in the TIMIT corpus can reach 77.409%.

    第1章 緒論 1 1.1 研究背景與動機 1 1.2 研究主要成果 1 1.3 論文組織與架構 2 第2章 文獻回顧 3 2.1 高斯混合模型 3 2.2 隱藏式馬可夫模型 3 2.3 遞迴式神經網路 4 2.3.1 基本介紹 4 2.3.2 長短期記憶 6 2.4 本章摘要 7 第3章 基於編解碼器模型的語音辨識 8 3.1 簡介 8 3.2 編解碼器模型 9 3.2.1 基本模型架構 9 3.2.2 注意力機制 10 3.2.3 音素分類層 12 3.3 訓練方式 13 3.3.1 強迫教學 13 3.3.2 排程抽樣 14 3.3.3 自迴歸模型 15 3.4 基礎實驗 17 3.4.1 語料介紹 17 3.4.2 實驗設定 17 3.4.3 基線實驗 18 3.5 雙向多層遞迴式網路編碼層實驗結果 27 3.6 本章摘要 27 第4章 編解碼器模型之改進 28 4.1 簡介 28 4.2 分類層輸入比較 28 4.3 注意力引導 32 4.4 實驗結果 34 4.5 結論 37 第5章 結論與未來展望 38 參考文獻 39

    [1] Xiaodong Lui, et.al., “A study of variable parameter Gaussian mixture HMM modeling for Noisy speech recognition,“ IEEE Transactions on Audio, Speech and Language processing, Vol.15,No.1, Jan. 2007.
    [2] J. Elman., “Finding Structure in Time,“ Cognitive Science, 14, 179-211, 1990
    [3] Sepp H. and Jürgen S., “Long short-term memory,”. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    [4] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul, ¨cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.”Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078.
    [5] D.A. Reynolds and R.C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,“ IEEE Trans. Speech Audio Process,vol. 3, no. 1, pp. 72–83, 1995.
    [6] Graves, Alex, Wayne, Greg, and Danihelka, Ivo, ”Neural turing machines,”arXiv preprint arXiv:1410.5401, 2014.
    [7] D. Bahdanau, K. Cho, and Y. Bengio.,”Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014
    [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio., ”Attention-Based Models for Speech Recognition,” In http://arxiv.org/abs/1506.07503, 2015.
    [9] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio., ”End-to-end continuous speech recognition using attention-based recurrent NN: first results,” CoRR, abs/1412.1602, 2014.
    [10] Minh-Thang Luong, Hieu Pham, and Christopher D Manning., ”Effective approaches to attentionbased neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
    [11] Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N., ”Scheduled sampling for sequence prediction with recurrent neural networks,” In NIPS, 2015.
    [12] Tetiana Parshakova, Jean-Marc Andreoli, and Marc Dymetman., “Global Autoregressive Models for Data-Efficient Sequence Learning,” In CoNLL 2019, Hong Kong, November 2019.
    [13] D. Povey, A. Ghoshal, G. Boulianne, et al., “The kaldi speech recognition toolkit,” In Proc. ASRU, 2011.
    [14] C Chiu, T Sainath, Y Wu, R Prabhavalkar, P Nguyen, Z Chen, A Kannan, R Weiss, K Rao, K Gonina, N Jaitly, B Li, J Chorowski, and M Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” CoRR, vol. abs/1712.0, 2017.
    [15] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
    [16] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” arXiv preprint arXiv:1508.04395, 2015.
    [17] Z. Xiao et al., “Hybrid ctc-attention based end-to-end speech recognition using subword units,” in Proc. of ISCSLP, 2018.
    [18] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” CoRR, vol. abs/1609.06773, 2016.
    [19] D. Povey, L. Burget, M. Agarwal, and et al., “The subspace gaussian mixture model - a structured model for speech recognition,” Computer Speech & Language, vol. 25, pp. 404–439, 2011.
    [20] S. H. Mallidi, T. Ogawa, K. Vesely, P. S. Nidadavolu, and ´ H. Hermansky, “Autoencoder based Multi-Stream Combination for Noise Robust Speech Recognition,” in Proc. Interspeech, Dresden, Germany, Sep. 2015, pp. 3551–3555.
    [21] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. “Incorporating copying mechanism in sequence-to-sequence learning,” In Proceedings of the 54th ACL, Berlin, Germany. to appear.

    QR CODE