簡易檢索 / 詳目顯示

研究生: 黃詮盛
Chuan-Sheng Huang
論文名稱: 多層異質CTC神經網路架構於語音辨識之研究
A Study of Heterogeneous CTC Network for Speech Recognition
指導教授: 林伯慎
Bor-Shen Lin
口試委員: 楊傳凱
Chuan-Kai Yang
陳柏琳
Ber-Lin Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 64
中文關鍵詞: 語音辨識連結時序分類卷積神經網路多層異質神經網路音素分類
外文關鍵詞: ASR, CTC, CLDNN, LCDNN, phoneme classification
相關次數: 點閱:190下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 連結時序分類(Connectionist Temporal Classification, CTC)是一種結合深度學習的序列預測方法,其動態規劃概念與隱馬可夫模型(Hidden Markov Models, HMM)相似,在語音辨識領域上,CTC較HMM有更低的複雜度和更好的表現。前人發現CTC效能優於HMM,但卻沒有探討CTC學習過程和錯誤的分佈情形。在本論文中,我們會透過觀察基礎的深度類神經網路(Deep Neural Network, DNN)搭配CTC的序列預測結果,分析全域、特定位置的音素正確率,以及音素替代錯誤與訓練樣本數間的關係,試圖理解CTC的學習過程。另外,搭配卷積神經網路(Convolutional Neural Network, CNN)、長短期記憶遞迴式神經網路(Long Short-Term Memory, LSTM) 和DNN的CLDNN(Convolutional Long short-term memory Deep Neural Network)已被應用在語音辨識,但其序列解碼是使用HMM而非CTC。本論文中嘗試以異質神經網路架構搭配CTC解碼,提出將LSTM改置於輸入層的LCDNN(Long short-term memory Convolutional Deep Neural Network);這可以避免LSTM因前面CNN中max pooling降低取樣頻率導致損失資訊,而無法擷取出較有鑑別力的特徵。實驗結果顯示,LCDNN可以獲得比CLDNN架構更好的效能。我們也提出以三維語音特徵結合平坦化三維CNN(3x3x3卷積核),和使用二維語音特徵的CNN比較。實驗結果發現,使用三維語音特徵與平坦化三維CNN,都可以獲得比二維特徵CNN更好的效能。最後我們使用兩層LSTM的LCDNN,搭配平坦化三維CNN,在TIMIT英語語料的辨識任務得到了75.42%的音素平均正確率。


    Connectionist Temporal Classification (CTC) is a sequence prediction method combined with deep learning. Its dynamic programming concept is similar to Hidden Markov Models (HMM). CTC has lower complexity and better performance than HMM. The predecessors found that the performance of CTC is better than HMM, but nobody investigates that the learning process of CTC and the distribution of errors. In this paper, we will observe the sequence prediction results of Deep Neural Network (DNN) with CTC, analyze the correct rate of phonemes in global and specific locations, also the relationship between phoneme substitution errors and the number of training samples, try to understand the learning process of CTC. In addition, CLDNN(Convolutional Long short-term memory Deep Neural Network) which is formed of Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and DNN has been used in speech recognition, but its sequence decoding part is matched with HMM instead of CTC. In this paper, we will use a heterogeneous network architecture with CTC decoding, and proposed LCDNN (Long short-term memory Convolutional Deep Neural Network) which changes LSTM layer to the input layer. This can prevent LSTM from losing feature information because of reducing the sampling frequency by the max pooling in the previous CNN, and make it unable to extract more discriminating features. The experimental results show that LCDNN can achieve better performance than CLDNN. We also propose to combine 3D speech features with flattened 3D CNN (3x3x3 kernel), and compare it with CNN using 2D speech features. The experimental results show that the use of 3D speech features and flattened 3D CNN can achieve better performance than 2D feature CNN. Finally, we use a two-layer LSTM LCDNN with a flattened 3D CNN to obtain an average phoneme accuracy rate of 75.42% on the recognition task in TIMIT English corpus.

    目錄 第1章 序論 1 1.1 研究背景與動機 1 1.2 研究主要成果 3 1.3 論文組織與架構 4 第2章 文獻回顧 5 2.1 連結時序分類(Connectionist Temporal Classification, CTC) 5 2.1.1 基礎定義 6 2.1.2 訓練過程之推導 7 2.2 遞歸神經網路(Recurrent Neural Network, RNN) 12 2.3 卷積神經網路(Convolutional Neural Network, CNN) 18 2.4 多層異質神經網路(Convolutional Long Short-Term Memory Fully Connected Deep Neural Networks, CLDNN) 20 2.5 本章摘要 25 第3章 CTC學習過程探討 26 3.1 CTC辨識架構 26 3.2 基礎實驗 27 3.3 句首和句尾音素正確率 32 3.4 替代錯誤分析 35 3.5 本章摘要 38 第4章 卷積神經網路結合CTC 39 4.1 引言 39 4.2 2D CNN與3D CNN 41 4.3 CLDNN與LCDNN 結合CTC 45 4.4 本章摘要 50 第5章 結論 51 參考文獻 52

    參考文獻
    [1] Sepp H. and Jürgen S. “Long Short-Term Memory”. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    [2] LeCun, Yann, Patrick Haffner, Léon Bottou and Yoshua Bengio. “Object Recognition with Gradient-Based Learning.” Shape, Contour and Grouping in Computer Vision , 1999.
    [3] Alex G., Santiago F., Faustino G., and Jürgen S. ”Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks” In Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
    [4] A. Graves, “Supervised Sequence Labelling with Recurrent Neural Networks” vol. 385, Springer, 2012.
    [5] A. Mohamed, G. Hinton and G. Penn, "Understanding How Deep Belief Networks Perform Acoustic Modelling," 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, 2012, pp. 4273-4276.
    [6] Tara N Sainath, Oriol Vinyals, A. Senior, and H. Sak, “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
    [7] Sainath, Tara N., Ron J. Weiss, Andrew W. Senior, Kevin W. Wilson and Oriol Vinyals. “Learning the Speech Front-end with Raw Waveform CLDNNs.” INTERSPEECH, 2015.
    [8] Yajie M., Mohammad G., and Florian M. “EESEN: End-to-end Speech Recognition Using Deep RNN Models and WFST-based Decoding” In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 167–174.
    [9] Xingjian SHI, Chen Z, Wang H, et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. 2015, pp.802-810.
    [10] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
    [11] T. Sercu, C. Puhrsch, B. Kingsbury and Y. LeCun, "Very Deep Multilingual Convolutional Neural Networks for LVCSR," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 2016, pp. 4955-4959.
    [12] Tom Sercu, Vaibhava Goel, “Dense Prediction on Sequences with Time-dilated Convolutions for Speech Recognition” arXiv:1611.09288, 2016
    [13] Chien-hung Lai, Yih-Ru Wang, ” A Study on Mandarin Speech Recognition using Long Short- Term Memory Neural Network,” Computational Linguistics and Chinese Language Processing, Vol. 23, No. 2, December 2018, pp. 1-18
    [14] Kurata, Gakuto and Kartik Audhkhasi. “Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation.” INTERSPEECH, 2019.
    [15] J. Heymann, K. C. Sim and B. Li, "Improving CTC Using Stimulated Learning for Sequence Modeling," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 5701-5705.
    [16] D. He, X. Yang, B. P. Lim, Y. Liang, M. Hasegawa-Johnson and D. Chen, "When CTC Training Meets Acoustic Landmarks," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 5996-6000
    [17] Y. Shi, M. Hwang and X. Lei, "End-to-end Speech Recognition Using a High Rank LSTM-CTC Based Model," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 7080-7084.
    [18] T Sercu, N Mallinar, “Multi-Frame Cross-Entropy Training for Convolutional Neural Networks in Speech Recognition” arXiv:1907.13121, 2019.
    [19] Y. Feng, Y. Zhang and X. Xu, "End-to-end Speech Recognition System Based on Improved CLDNN Structure," 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 2019, pp. 538-542.
    [20] 黃思齊。「結合連結時序分類與高斯混合模型之英語音素辨識研究」。碩士論文,國立臺灣科技大學資訊管理系,2019。

    QR CODE