簡易檢索 / 詳目顯示

研究生: 張天佑
Tian-You Zhang
論文名稱: 基於 Transformer 的旋轉遮掩矩陣合成器與線性注意力機制構造的端到端語音辨識系統
Transformer-based end to end speech recognition with rotary matrix mask synthesizer and linear attention
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 陳柏琳
Berlin Chen
曾厚強
Hou-Chiang Tseng
洪志偉
Jeih-Weih Hung
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 72
中文關鍵詞: 合成器注意力機制旋轉遮掩矩陣線性注意力機制後置式旋轉位置編碼
外文關鍵詞: Synthesizer, rotary matrix mask, linear attention, rear rotary position encoding
相關次數: 點閱:290下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著注意力機制的發展,Transformer 與 Conformer 逐步佔據各個領域的頭 條版面,而他們通過與 CTC 的混合模型也在端對端語音辨識系統中大放異彩。 本文主要使用基於 Transformer 的模型進行研究,重新思考自注意力機制在 Transformer 中扮演的角色,並嘗試使用非自注意力機制構造模型。
    我們將結構簡單的合成器(Synthesizer)注意力機制作為出發點。在效能看 來,簡單的架構固然讓其與自注意力機制遜色不少。但我們受到位置編碼的啟發, 為 Synthesizer 構造一個特別的 mask,模型稱之為擁有旋轉遮掩矩陣的 Synthesizer Dense Attention。實驗證明,我們的方法比擁有絕對位置編碼的 Dense Attention 提升了約 4.8%,並且與擁有絕對位置編碼的自注意力機制擁有近乎一致的辨識 能力。這一成績鼓舞了我們嘗試繼續提升模型效能並超過基礎模型(baseline)。
    在提升的過程中,我們也探討了卷積的加持與類 mask 方法可能存在著衝突 的情況。所以存在於自注意力機制中的“pairwise”依舊是提升效能的關鍵。我 們選擇三種風格迥異的線性注意力機制進行探討,在兩份公開語料上,都得出通 過將我們的旋轉遮掩矩陣的 Synthesizer Dense Attention 與擁有後置式旋轉位置 編碼的線性注意力機制混合使用皆可以超過 baseline,分別獲得了 2.3%,2.1%的 提升。實驗證明,不同構造的非傳統自注意力機制獲取資訊的敏感度不一樣,在 提取資訊過程中,對聲音特徵偏好也不盡相同,所以確實存在利用它們的特點, 從而達到互補的目的。


    With the development of attention mechanism, Transformer and Conformer gradually occupy the headlines in various fields and they also shine in end-to-end speech recognition systems through a hybrid model with CTC. We use the Transformer- based model for research, rethink the role of the self-attention mechanism in the Transformer and try to construct a model with non-self-attention mechanism.
    We take the simple-structured synthesizer attention mechanism as the beginning of our research. We construct a special mask for the synthesizer, which the model calls Synthesizer Dense Attention with a rotary mask matrix. Experiments show that our method outperforms Dense Attention with absolute position encoding by about 4.8%, and has nearly the same performance as the self-attention mechanism with absolute position encoding.We continue to improve the model performance and exceed the baseline.
    The "pairwise" existing in the self-attention mechanism is still the key to improve performance and we choose three linear attention mechanisms to discuss. it is concluded that the combination of Dense Attention of our rotary matrix mask and Efficient Attention mechanism with rear rotary position encoding can exceed the baseline.We used two datasets for experiments, and obtained 2.3% and 2.1% improvement respectively compared to the baseline. In the process of extracting information, there is indeed the possibility of using their characteristics to achieve complementarity.

    目錄 致謝 I 摘要 II ABSTRACT III 圖目錄 VII 表目錄 IX 第一章 緒論 1 1.1 研究背景 1 1.1.1自然語言處理 1 1.1.2語音辨識 2 第二章 相關研究 3 2.1 CONNECTIONIST TEMPORAL CLASSIfiCATION 3 2.2 通用的以自注意力機制為架構的基礎模型 11 2.2.1 Transformer 11 2.2.2 Conformer 18 2.3 HYBIRD CTC / TRANSFORMER ARCHITECTURE 21 2.4 合成器注意力機制(SYNTHESIZER DENSE ATTENTION) 22 2.5 線性注意力機制 24 2.5.1 Efficient Attention 24 2.5.2 Nyströmformer 26 2.5.3 cosFormer 28 2.6 位置編碼 32 2.6.1絕對位置編碼 32 2.6.2相對位置編碼 34 2.6.2.1 關係感知的相對位置編碼 34 2.6.2.2 transformer-XL的相對位置編碼 35 2.6.2.3 T5相對位置編碼 35 2.6.3旋轉位置編碼 37 2.7 LOCAL DENSE SYNTHESIZER注意力機制 39 2.8基於CONFORMER的線性注意力機制 41 2.9基於CONFORMER相對位置編碼的線性注意力機制 42 2.9.1 局部加權偏置的注意力機制 42 2.9.2 旋轉位置編碼的線性Nyström注意力機制 43 2.10偏差值對注意力的影響 44 2.11語音辨識工具 45 第三章 實驗動機與提出的方法 47 3.1實驗動機 47 3.2實驗方法 49 3.2.1擁有卷積神經網路的Synthesizer注意力機制 49 3.2.2擁有旋轉位置遮掩矩陣的Synthesizer Dense Attention機制 50 3.2.3混合注意力機制 53 3.2.4擁有後置式旋轉位置編碼的線性注意力機制 54 第四章 實驗 55 4.1實驗資料集 55 4.2語音辨識配置設定 56 4.3實驗結果與分析 57 4.3.1卷積神經網路對Synthesizer注意力機制影響分析 57 4.3.2 各種Synthesizer注意力機制結果分析 59 4.3.3線性注意力機制的實驗分析 61 4.3.4混合模型的實驗分析 64 4.3.5模型複雜度分析 66 第五章 結論與未來展望 67 第六章 參考文獻 68

    [1] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-44, May 28 2015, doi: 10.1038/nature14539.
    [2] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
    [3] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
    [4] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," Advances in neural information processing systems, vol. 27, 2014.
    [5] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
    [6] W. Zaremba, I. Sutskever, and O. Vinyals, "Recurrent neural network regularization," arXiv preprint arXiv:1409.2329, 2014.
    [7] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
    [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555, 2014.
    [9] X. Glorot, A. Bordes, and Y. Bengio, "Deep sparse rectifier neural networks," in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011: JMLR Workshop and Conference Proceedings, pp. 315-323.
    [10] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
    [11] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in neural information processing systems, vol. 26, 2013.
    [12] X. Rong, "word2vec parameter learning explained," arXiv preprint arXiv:1411.2738, 2014.
    [13] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
    [14] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
    [15] T. Q. Nguyen and J. Salazar, "Transformers without tears: Improving the normalization of self-attention," arXiv preprint arXiv:1910.05895, 2019.
    [16] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech recognition," arXiv preprint arXiv:2005.08100, 2020.
    [17] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," in International conference on machine learning, 2017: PMLR, pp. 933-941.
    [18] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for activation functions," arXiv preprint arXiv:1710.05941, 2017.
    [19] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for end-to-end speech recognition," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
    [20] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, "Synthesizer: Rethinking self-attention for transformer models," in International conference on machine learning, 2021: PMLR, pp. 10183-10192.
    [21] Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, "Efficient attention: Attention with linear complexities," in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3531-3539.
    [22] Y. Xiong et al., "Nyströmformer: A nyström-based algorithm for approximating self-attention," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 16, pp. 14138-14148.
    [23] C. Williams and M. Seeger, "Using the Nyström method to speed up kernel machines," Advances in neural information processing systems, vol. 13, 2000.
    [24] S. Wang and Z. Zhang, "Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling," The Journal of Machine Learning Research, vol. 14, no. 1, pp. 2729-2769, 2013.
    [25] M. K. Razavi, A. Kerayechian, M. Gachpazan, and S. Shateyi, "A new iterative method for finding approximate inverses of complex matrices," in Abstract and Applied Analysis, 2014, vol. 2014: Hindawi.
    [26] Y. Huang et al., "Gpipe: Efficient training of giant neural networks using pipeline parallelism," Advances in neural information processing systems, vol. 32, 2019.
    [27] K. Zhang, I. W. Tsang, and J. T. Kwok, "Improved Nyström low-rank approximation and error analysis," in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1232-1239.
    [28] A. Vyas, A. Katharopoulos, and F. Fleuret, "Fast transformers with clustered attention," Advances in Neural Information Processing Systems, vol. 33, pp. 21665-21674, 2020.
    [29] Z. Qin et al., "cosFormer: Rethinking Softmax in Attention," arXiv preprint arXiv:2202.08791, 2022.
    [30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
    [31] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," 2018.
    [32] H. Yan, B. Deng, X. Li, and X. Qiu, "TENER: adapting transformer encoder for named entity recognition," arXiv preprint arXiv:1911.04474, 2019.
    [33] P. Shaw, J. Uszkoreit, and A. Vaswani, "Self-attention with relative position representations," arXiv preprint arXiv:1803.02155, 2018.
    [34] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, "Transformer-xl: Attentive language models beyond a fixed-length context," arXiv preprint arXiv:1901.02860, 2019.
    [35] C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," J. Mach. Learn. Res., vol. 21, no. 140, pp. 1-67, 2020.
    [36] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, "Roformer: Enhanced transformer with rotary position embedding," arXiv preprint arXiv:2104.09864, 2021.
    [37] M. Xu, S. Li, and X.-L. Zhang, "Transformer-based end-to-end speech recognition with local dense synthesizer attention," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 5899-5903.
    [38] S. Li, M. Xu, and X.-L. Zhang, "Efficient conformer-based speech recognition with linear attention," in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2021: IEEE, pp. 448-453.
    [39] J. Sun, G. Zhong, D. Zhou, B. Li, and Y. Zhong, "Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition," arXiv preprint arXiv:2203.15609, 2022.
    [40] L. Samarakoon and T.-Y. Leung, "Conformer-Based Speech Recognition with Linear Nyström Attention and Rotary Position Embedding," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 8012-8016.
    [41] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, "Self-attentional acoustic models," arXiv preprint arXiv:1803.09519, 2018.
    [42] S. Watanabe et al., "Espnet: End-to-end speech processing toolkit," arXiv preprint arXiv:1804.00015, 2018.
    [43] A. Paszke et al., "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, vol. 32, 2019.
    [44] D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE 2011 workshop on automatic speech recognition and understanding, 2011, no. CONF: IEEE Signal Processing Society.
    [45] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, "Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline," in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), 2017: IEEE, pp. 1-5.
    [46] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
    [47] Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012-10022.

    無法下載圖示 全文公開日期 2025/08/30 (校內網路)
    全文公開日期 2025/08/30 (校外網路)
    全文公開日期 2025/08/30 (國家圖書館:臺灣博碩士論文系統)
    QR CODE