基於 Transformer 的旋轉遮掩矩陣合成器與線性注意力機制構造的端到端語音辨識系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	張天佑 Tian-You Zhang
論文名稱：	基於 Transformer 的旋轉遮掩矩陣合成器與線性注意力機制構造的端到端語音辨識系統 Transformer-based end to end speech recognition with rotary matrix mask synthesizer and linear attention
指導教授：	陳冠宇 Kuan-Yu Chen
口試委員:	陳柏琳 Berlin Chen 曾厚強 Hou-Chiang Tseng 洪志偉 Jeih-Weih Hung
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	中文
論文頁數：	72
中文關鍵詞：	合成器注意力機制、旋轉遮掩矩陣、線性注意力機制、後置式旋轉位置編碼
外文關鍵詞：	Synthesizer, rotary matrix mask, linear attention, rear rotary position encoding
相關次數：	點閱：290 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著注意力機制的發展，Transformer 與 Conformer 逐步佔據各個領域的頭條版面，而他們通過與 CTC 的混合模型也在端對端語音辨識系統中大放異彩。本文主要使用基於 Transformer 的模型進行研究，重新思考自注意力機制在 Transformer 中扮演的角色，並嘗試使用非自注意力機制構造模型。
我們將結構簡單的合成器(Synthesizer)注意力機制作為出發點。在效能看來，簡單的架構固然讓其與自注意力機制遜色不少。但我們受到位置編碼的啟發，為 Synthesizer 構造一個特別的 mask，模型稱之為擁有旋轉遮掩矩陣的 Synthesizer Dense Attention。實驗證明，我們的方法比擁有絕對位置編碼的 Dense Attention 提升了約 4.8%，並且與擁有絕對位置編碼的自注意力機制擁有近乎一致的辨識能力。這一成績鼓舞了我們嘗試繼續提升模型效能並超過基礎模型(baseline)。
在提升的過程中，我們也探討了卷積的加持與類 mask 方法可能存在著衝突的情況。所以存在於自注意力機制中的“pairwise”依舊是提升效能的關鍵。我們選擇三種風格迥異的線性注意力機制進行探討，在兩份公開語料上，都得出通過將我們的旋轉遮掩矩陣的 Synthesizer Dense Attention 與擁有後置式旋轉位置編碼的線性注意力機制混合使用皆可以超過 baseline，分別獲得了 2.3%，2.1%的提升。實驗證明，不同構造的非傳統自注意力機制獲取資訊的敏感度不一樣，在提取資訊過程中，對聲音特徵偏好也不盡相同，所以確實存在利用它們的特點，從而達到互補的目的。

With the development of attention mechanism, Transformer and Conformer gradually occupy the headlines in various fields and they also shine in end-to-end speech recognition systems through a hybrid model with CTC. We use the Transformer- based model for research, rethink the role of the self-attention mechanism in the Transformer and try to construct a model with non-self-attention mechanism.
We take the simple-structured synthesizer attention mechanism as the beginning of our research. We construct a special mask for the synthesizer, which the model calls Synthesizer Dense Attention with a rotary mask matrix. Experiments show that our method outperforms Dense Attention with absolute position encoding by about 4.8%, and has nearly the same performance as the self-attention mechanism with absolute position encoding.We continue to improve the model performance and exceed the baseline.
The "pairwise" existing in the self-attention mechanism is still the key to improve performance and we choose three linear attention mechanisms to discuss. it is concluded that the combination of Dense Attention of our rotary matrix mask and Efficient Attention mechanism with rear rotary position encoding can exceed the baseline.We used two datasets for experiments, and obtained 2.3% and 2.1% improvement respectively compared to the baseline. In the process of extracting information, there is indeed the possibility of using their characteristics to achieve complementarity.

目錄
致謝    I
摘要    II
ABSTRACT    III
圖目錄    VII
表目錄    IX
第一章 緒論    1
1 研究背景    1
1.1自然語言處理    1
1.2語音辨識    2
第二章 相關研究    3
1 CONNECTIONIST TEMPORAL CLASSIﬁCATION    3
2 通用的以自注意力機制為架構的基礎模型    11
2.1 Transformer    11
2.2 Conformer    18
3 HYBIRD CTC / TRANSFORMER ARCHITECTURE    21
4 合成器注意力機制（SYNTHESIZER DENSE ATTENTION）    22
5 線性注意力機制    24
5.1 Efficient Attention    24
5.2 Nyströmformer    26
5.3 cosFormer    28
6 位置編碼    32
6.1絕對位置編碼    32
6.2相對位置編碼    34
6.2.1 關係感知的相對位置編碼    34
6.2.2 transformer-XL的相對位置編碼    35
6.2.3 T5相對位置編碼    35
6.3旋轉位置編碼    37
7 LOCAL DENSE SYNTHESIZER注意力機制    39
8基於CONFORMER的線性注意力機制    41
9基於CONFORMER相對位置編碼的線性注意力機制    42
9.1 局部加權偏置的注意力機制    42
9.2 旋轉位置編碼的線性Nyström注意力機制    43
10偏差值對注意力的影響    44
11語音辨識工具    45
第三章 實驗動機與提出的方法    47
1實驗動機    47
2實驗方法    49
2.1擁有卷積神經網路的Synthesizer注意力機制    49
2.2擁有旋轉位置遮掩矩陣的Synthesizer Dense Attention機制    50
2.3混合注意力機制    53
2.4擁有後置式旋轉位置編碼的線性注意力機制    54
第四章 實驗    55
1實驗資料集    55
2語音辨識配置設定    56
3實驗結果與分析    57
3.1卷積神經網路對Synthesizer注意力機制影響分析    57
3.2 各種Synthesizer注意力機制結果分析    59
3.3線性注意力機制的實驗分析    61
3.4混合模型的實驗分析    64
3.5模型複雜度分析    66
第五章 結論與未來展望    67
第六章 參考文獻    68

                                

[1] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-44, May 28 2015, doi: 10.1038/nature14539.
[2] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
[3] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[4] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," Advances in neural information processing systems, vol. 27, 2014.
[5] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
[6] W. Zaremba, I. Sutskever, and O. Vinyals, "Recurrent neural network regularization," arXiv preprint arXiv:1409.2329, 2014.
[7] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555, 2014.
[9] X. Glorot, A. Bordes, and Y. Bengio, "Deep sparse rectifier neural networks," in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011: JMLR Workshop and Conference Proceedings, pp. 315-323.
[10] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
[11] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in neural information processing systems, vol. 26, 2013.
[12] X. Rong, "word2vec parameter learning explained," arXiv preprint arXiv:1411.2738, 2014.
[13] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[14] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[15] T. Q. Nguyen and J. Salazar, "Transformers without tears: Improving the normalization of self-attention," arXiv preprint arXiv:1910.05895, 2019.
[16] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech recognition," arXiv preprint arXiv:2005.08100, 2020.
[17] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," in International conference on machine learning, 2017: PMLR, pp. 933-941.
[18] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for activation functions," arXiv preprint arXiv:1710.05941, 2017.
[19] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for end-to-end speech recognition," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
[20] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, "Synthesizer: Rethinking self-attention for transformer models," in International conference on machine learning, 2021: PMLR, pp. 10183-10192.
[21] Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, "Efficient attention: Attention with linear complexities," in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3531-3539.
[22] Y. Xiong et al., "Nyströmformer: A nyström-based algorithm for approximating self-attention," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 16, pp. 14138-14148.
[23] C. Williams and M. Seeger, "Using the Nyström method to speed up kernel machines," Advances in neural information processing systems, vol. 13, 2000.
[24] S. Wang and Z. Zhang, "Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling," The Journal of Machine Learning Research, vol. 14, no. 1, pp. 2729-2769, 2013.
[25] M. K. Razavi, A. Kerayechian, M. Gachpazan, and S. Shateyi, "A new iterative method for finding approximate inverses of complex matrices," in Abstract and Applied Analysis, 2014, vol. 2014: Hindawi.
[26] Y. Huang et al., "Gpipe: Efficient training of giant neural networks using pipeline parallelism," Advances in neural information processing systems, vol. 32, 2019.
[27] K. Zhang, I. W. Tsang, and J. T. Kwok, "Improved Nyström low-rank approximation and error analysis," in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1232-1239.
[28] A. Vyas, A. Katharopoulos, and F. Fleuret, "Fast transformers with clustered attention," Advances in Neural Information Processing Systems, vol. 33, pp. 21665-21674, 2020.
[29] Z. Qin et al., "cosFormer: Rethinking Softmax in Attention," arXiv preprint arXiv:2202.08791, 2022.
[30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[31] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," 2018.
[32] H. Yan, B. Deng, X. Li, and X. Qiu, "TENER: adapting transformer encoder for named entity recognition," arXiv preprint arXiv:1911.04474, 2019.
[33] P. Shaw, J. Uszkoreit, and A. Vaswani, "Self-attention with relative position representations," arXiv preprint arXiv:1803.02155, 2018.
[34] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, "Transformer-xl: Attentive language models beyond a fixed-length context," arXiv preprint arXiv:1901.02860, 2019.
[35] C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," J. Mach. Learn. Res., vol. 21, no. 140, pp. 1-67, 2020.
[36] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, "Roformer: Enhanced transformer with rotary position embedding," arXiv preprint arXiv:2104.09864, 2021.
[37] M. Xu, S. Li, and X.-L. Zhang, "Transformer-based end-to-end speech recognition with local dense synthesizer attention," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 5899-5903.
[38] S. Li, M. Xu, and X.-L. Zhang, "Efficient conformer-based speech recognition with linear attention," in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2021: IEEE, pp. 448-453.
[39] J. Sun, G. Zhong, D. Zhou, B. Li, and Y. Zhong, "Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition," arXiv preprint arXiv:2203.15609, 2022.
[40] L. Samarakoon and T.-Y. Leung, "Conformer-Based Speech Recognition with Linear Nyström Attention and Rotary Position Embedding," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 8012-8016.
[41] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, "Self-attentional acoustic models," arXiv preprint arXiv:1803.09519, 2018.
[42] S. Watanabe et al., "Espnet: End-to-end speech processing toolkit," arXiv preprint arXiv:1804.00015, 2018.
[43] A. Paszke et al., "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, vol. 32, 2019.
[44] D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE 2011 workshop on automatic speech recognition and understanding, 2011, no. CONF: IEEE Signal Processing Society.
[45] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, "Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline," in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), 2017: IEEE, pp. 1-5.
[46] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
[47] Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012-10022.

全文公開日期 2025/08/30 (校內網路)
全文公開日期 2025/08/30 (校外網路)
全文公開日期 2025/08/30 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文