簡易檢索 / 詳目顯示

研究生: 黃思齊
Si-Chi Huang
論文名稱: 結合連結時序分類與高斯混合模型之英語音素辨識研究
Combining Connectionist Temporal Classification with Gaussian Mixture Models for English Phone Recognition
指導教授: 林伯慎
Bor-Shen Lin
口試委員: 楊傳凱
ckyang@cs.ntust.edu.tw
羅乃維
nwlo@cs.ntust.edu.tw
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2019
畢業學年度: 108
語文別: 中文
論文頁數: 47
中文關鍵詞: 語音辨識高斯混合模型連結時序分類長短期記憶神經元
外文關鍵詞: ASR, CTC with GMM, Gaussian Mixture Model, Blank Model
相關次數: 點閱:299下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 連結時序分類(CTC)是一種結合動態規劃與深度學習的序列預測方法,其架構與傳統的隱馬夫模型相似,但複雜度較低、卻能獲致更佳的語音辨識效能。過去的研究雖驗證了此方法的有效,然而對於此架構能達到良好效能的原因卻沒有深入分析解釋。本研究即是對此議題進行探討。首先,我們以CTC辨識架構,分別搭配高斯混合模型(CTC+GMM)及深度類神經網路(CTC+DNN),驗證CTC可以與不同類型分類器結合。接著,加入LSTM作為特徵提層取並進行實驗,發現特徵提取層可大幅提升辨識效能;而且,後端只要搭配簡單的分類器即可達到相同的效能;這顯示了特徵學習層在CTC架構中扮演極關鍵的角色。另外,在過去CTC的研究中已經知道引入Blank模型可吸收掉具有不確定性的音框,但因為各個音素模型參數在神經網路分類器中耦合,很難單獨解釋Blank模型的吸收是如何作用。而在CTC+GMM的模型中,我們可觀察到Blank模型有遠大於其它音素模型的變異數,這是良好吸收能力的原因。在訓練和辨識時,Blank模型就如同海平面,其它音素模型就如同島嶼,被吸收掉的音框就如同海平面下的海床或小島。基於此概念,我們提出了使用音素高斯混合模型搭配簡單的Blank均勻分布來進行訓練辨識。實驗顯示:此模型不僅能收斂,並且可達到相同的辨識效能。但若將Blank模型移除,則正確率大幅下降,辨識機率時序圖也顯示音素模型難以正常學習。綜合而言,Blank模型對CTC模型的收斂是不可或缺的,但可用簡單的均勻分布取代。


    Connectionist Temporal Classification (CTC) is a combination of dynamic programming and deep neural network for sequence prediction tasks such as speech recognition. Its architecture is similar to conventional hidden Markov model (HMM), but may achieve better performance at lower complexity. Though its effectiveness was verified in the past, there has yet been sufficient analysis that may explain how it works effectively and justify the design of the network structure. In this research, we aim to investigate and explain why CTC works by applying CTC combined with GMM, and comparing it with CTC with DNN in parallel. In the baseline experiment, both are a little worse than conventional HMM with higher complexity. When a feature extraction layer of LSTM is further applied, however, the accuracies can be improved drastically and exceed HMM. Good performance could even be obtained at a very simple setting, such as single mixture Gaussian. This indicates the feature extraction is critical for the excellence of CTC. In addition, the blank model in CTC is shown to be able to absorb the frames that transits between phonemes, but it is not well explained why and how it achieves this. In CTC with GMM it could be observed, blank model has much larger variance than any phone model. Accordingly, we propose to use uniform distribution to replace GMM for blank model, and verify successfully such model may converge and obtain compatible performance.

    第1章 序論 1.1 研究背景與動機 1.2 研究主要成果 1.3 論文組織與架構 第2章 文獻回顧 2.1 連結時序分類(connectionist temporal classification, CTC) 2.1.1 基礎定義 2.1.2 訓練過程之推導 2.2 高斯混合模型(Gaussian Mixture Model, GMM) 2.3 遞迴神經網絡(Recurrent Neural Network, RNN) 2.4 本章摘要 第3章 高斯混合模型建構CTC 3.1 模型架構 3.2 語料介紹 3.3 實驗設定 3.4 基礎實驗 3.5 基於特徵提取之實驗 3.6 本章摘要 第4章 Blank的探討 4.1 Blank模型的替換之實驗 4.2 Blank的使用與否之實驗 4.3 本章摘要 第5章 結論 參考文獻

    [1] F. Jelinek, “Continuous speech recognition by statistical methods” Proceedings of the IEEE, vol. 64, no. 4, pp. 532–556, 1976.
    [2] Sepp H. and Jürgen S. “Long short-term memory”. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    [3] A. Biem, S. Katagiri, E. McDermott, and B.-H. Juang. “An Application of Discriminative Feature Extraction to Filter-Bank Based Speech Recognition” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 2, pp. 96–110, 2001.
    [4] Alex G., Santiago F., Faustino G., and Jürgen S. ”Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks” In Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
    [5] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition” IEEE Signal Processing Magazine, vol. 29, pp. 82–97, November 2012.
    [6] A Graves, “Sequence transduction with recurrent neural networks” CoRR, vol. abs/1211.3, 2012
    [7] O. Abdel-Hamid, A.-r. Mohamed et al., “Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition,” in 2012 IEEE international conference on Acoustics, speech and signal processing (ICASSP). IEEE, 2012, pp. 4277–4280.
    [8] A. Graves, “Supervised sequence labelling with recurrent neural networks” vol. 385, Springer, 2012.
    [9] Matthias Paulik “Lattice-based training of bottleneck feature extraction neural networks” In INTERSPEECH, 2013, pp. 89– 93.
    [10] K. Vesely, A. Ghoshal, L. Burget, and D. Povey. “Sequence-discriminative training of deep neural networks” In Interspeech, 2013.
    [11] Alex G. and Navdeep J. “Towards end-to-end speech recognition with recurrent neural networks” In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1764–1772.
    [12] L. Deng and J. Chen. “Sequence classification using the high level features extracted from deep neural networks” In Proc. IEEE ICASSP, 2014.
    [13] Liu, S., Sim, K.C. “On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech” In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 195–199. Florence 2014.
    [14] A Hannun, C Case, J Casper, B Catanzaro, G Diamos, E Elsen, R Prenger, S Satheesh, S Sengupta, A Coates, and A Ng, “DeepSpeech: Scaling up end-to-end speech recognition” arXiv:1412.5567, 2014.
    [15] D Amodei, R Anubhai, E Battenberg, C Case, J Casper, B Catanzaro, J Chen, M Chrzanowski, and etc, “Deep Speech 2: end-to-end speech recognition in English and Mandarin” CoRR, vol. abs/1512.0, 2015.
    [16] Variani, E., McDermott, E., Heigold, G. “A Gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture” In: ICASSP, 2015.
    [17] Tara N Sainath, Oriol Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
    [18] Yajie M., Mohammad G., and Florian M. “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding” In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 167–174.
    [19] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 577–585.
    [20] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
    [21] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acousticto-word LSTM model for large vocabulary speech recognition” arXiv preprint arXiv:1610.09975, 2016.
    [22] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI” in Interspeech, 2016, pp. 2751– 2755.
    [23] S Kim, T Hori, and S Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning” in Proceedings of ICASSP, 2017
    [24] C Chiu, T Sainath, Y Wu, R Prabhavalkar, P Nguyen, Z Chen, A Kannan, R Weiss, K Rao, K Gonina, N Jaitly, B Li, J Chorowski, and M Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models” CoRR, vol. abs/1712.0, 2017.
    [25] W Xiong, J Droppo, X Huang, F Seide, M Seltzer, A Stolcke, D Yu, and G Zweig, “The Microsoft 2016 conversational speech recognition system” in Proceedings of ICASSP, 2017.

    QR CODE