簡易檢索 / 詳目顯示

研究生: 簡靖岳
Chin-Yueh Chien
論文名稱: 基於預訓練語言表示法之動態混合語言模型
A Dynamic Mixture Language Model based on Pre-trained Language Representations
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 王新民
林伯慎
陳柏琳
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 83
中文關鍵詞: 語言模型遞迴神經網路預訓練語言表示法
外文關鍵詞: BERT, Transformer, Language Model
相關次數: 點閱:218下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文介紹了多個膾炙人口的語言模型,從傳統的N連語言模型(N-gram Language Model)、緩存語言模型(Cache Language Model),到以神經網路為基礎的前饋神經網路語言模型(Feed-Forward Neural Network Language Model, NNLM)、遞迴神經網路語言模型(Recurrent Neural Network Language Model, RNNLM)以及基於Transformer架構的語言模型。本論文提出一套以變形模型之雙向編碼表示法模型(Bidirectional Encoder Representations from Transformers, BERT)為基礎的語言模型,包含基於BERT詞向量之遞迴神經網路語言模型、BERT混合語言模型、全局神經緩存語言模型以及局部神經緩存語言模型。此外,本論文提出一套動態權重調整的方法,藉由神經網路,根據不同歷史詞序列,自動地為每個語言模型預測一個動態的權重,進而得到一個強健的動態混合語言模型。本論文將混合語言模型驗證在純文字資料集Penn Treebank (PTB)、Wikitext 2上,測試集的困惑度(Perplexity, PPL)相較於傳統線性插值法可以獲得3.48%、6.05%的相對進步率;此外,混合語言模型亦驗證於語音辨識資料集Tedlium Release 2以及Wall Street Journal (WSJ)上,在測試集的詞錯誤率(Word Error Rate, WER)上相較於傳統線性插值法分別達到了2.7%以及3.1%的相對進步率。值得一提的是,在Tedlium Release 2資料集的測試集上,動態混合語言模型的詞錯誤率達到了6.51,為目前已知方法中的最好成績!


    This thesis introduces a variety of popular language models, including the classic N-gram language model, the cache language model, the neural probabilistic language model, the recurrent neural network language model, and the transformer-based language models. Based the bidirectional encoder representations from Transformers (BERT), this study proposes a series of novel language models, including BERT embedding-based recurrent neural network language model, BERT mixer language model, global neural cache language model and local neural cache language model. Besides, a dynamic mixture language model, which can dynamically determine a set of mixture weights for a set of language models by referring to the history, is also introduced in this study. The proposed language model has been evaluated on Penn Treebank (PTB) and Wikitext 2 corpora, and the relative improvements were up to 3.48% and 6.05% in perplexity (PPL) compared with traditional linear interpolation method, respectively. Moreover, the dynamic mixture language model can achieve 2.7% and 3.1% relative improvements in word error rate (WER) compared with traditional linear interpolation method on Tedlium Release 2 and Wall Street Journal (WST) datasets, respectively. It is worthy to note that, to the best of our knowledge, the proposed dynamic mixture language model has achieved state of the art results on the Tedlium Release 2 dataset.

    第1章 緒論 第2章 相關研究 第3章 基於預訓練語言表示法之動態混合語言模型 第4章 實驗架構 第5章 實驗結果 第6章 結論與未來展望 第7章 參考文獻

    [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, "A neural probabilistic language model," in Journal of machine learning research vol. 3, no. Feb, pp. 1137-1155, 2003.
    [2] T. Mikolov, S. Kombrink, L. Burget, J. Černocký, and S. Khudanpur, "Extensions of recurrent neural network language model," in IEEE international conference on acoustics, speech and signal processing, pp. 5528-5531, 2011.
    [3] N. Xue, "Chinese word segmentation as character tagging," in International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, pp. 29-48, 2003.
    [4] A. Ratnaparkhi, "A maximum entropy model for part-of-speech tagging," in Conference on empirical methods in natural language processing, 1996.
    [5] B. H. Juang and L. R. Rabiner, "Hidden Markov models for speech recognition," in Technometrics, vol. 33, no. 3, pp. 251-272, 1991.
    [6] A. Graves, A. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645-6649, 2013.
    [7] M. Maybury, "Advances in automatic text summarization," in MIT press, 1999.
    [8] G. Salton, A. Singhal, M. Mitra, C. Buckley, and management, "Automatic text structuring and summarization," in Information processing & management, vol. 33, no. 2, pp. 193-207, 1997.
    [9] R. Baeza-Yates and B. Ribeiro-Neto, "Modern information retrieval," in ACM press New York, 1999.
    [10] U. Cambridge, "Introduction to information retrieval," in Natural Language Engineering, 2009.
    [11] P. Koehn, "Statistical machine translation," in Cambridge University Press, 2009.
    [12] P. F. Brown et al., "A statistical approach to machine translation," in Computational linguistics, vol. 16, no. 2, pp. 79-85, 1990.
    [13] L. Hirschman and R. Gaizauskas, "Natural language question answering: the view from here," in natural language engineering, vol. 7, no. 4, p. 275, 2001.
    [14] L. Muda and M. Begam, "Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques," arXiv preprint arXiv:1003.4083, 2010.
    [15] M. Tomáš, K. Martin, B. Lukáš, Č. Jan, and K. Sanjeev, "Recurrent neural network based language model," in INTERSPEECH 2010, 2010.
    [16] G. Forney, "The viterbi algorithm," in Proceedings of the IEEE, vol. 61, no. 3, pp. 268-278, 1973.
    [17] J. Devlin, M. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
    [18] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," in OpenAI Blog, 2019.
    [19] R. Kuhn et al., "A cache-based natural language model for speech recognition," in IEEE transactions on pattern analysis and machine intelligence, vol. 12, no. 6, pp. 570-583, 1990.
    [20] K. Li et al., "An Empirical Study of Transformer-Based Neural Language Model Adaptation," in IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 7934-7938, 2020.
    [21] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
    [22] M. Peters et al., "Deep contextualized word representations," arXiv preprint arXiv:1802.05365, 2018.
    [23] S. Hochreiter and J. Schmidhuber, "Long short-term memory," in Neural computation vol. 9, no. 8, pp. 1735-1780, 1997.
    [24] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, "On the properties of neural machine translation: Encoder-decoder approaches," arXiv preprint arXiv:1409.1259, 2014.
    [25] M. Sundermeyer, R. Schlüter, and H. Ney, "LSTM neural networks for language modeling," in Thirteenth annual conference of the international speech communication association, 2012.
    [26] A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, pp. 5998-6008, 2017.
    [27] Q. Wang et al., "Learning deep transformer models for machine translation," arXiv preprint arXiv:1906.01787 2019.
    [28] B. Zhang, I. Titov, and R. J. a. p. a. Sennrich, "Improving deep transformer with depth-scaled initialization and merged attention," 2019.
    [29] K. Irie, A. Zeyer, R. Schlüter, and H. Ney, "Language modeling with deep Transformers," in INTERSPEECH 2019, 2019.
    [30] K. Irie, S. Kumar, M. Nirschl, and H. Liao, "RADMM: recurrent adaptive mixture model with applications to domain robust language modeling," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6079-6083, 2018.
    [31] D. M. Blei, A. Y. Ng, and M. Jordan, "Latent dirichlet allocation," in Journal of machine Learning research, vol. 3, no. Jan, pp. 993-1022, 2003.
    [32] M. Hentschel, M. Delcroix, A. Ogawa, T. Iwata, and T. Nakatani, "Factorised hidden layer based domain adaptation for recurrent neural network language models," in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1940-1944, 2018.
    [33] T. Moon et al., "The expectation-maximization algorithm," in IEEE Signal processing magazine, vol. 13, no. 6, pp. 47-60, 1996.
    [34] M. Hentschel, M. Delcroix, A. Ogawa, T. Iwata, and T. Nakatani, "A unified framework for feature-based domain adaptation of neural network language models," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7250-7254, 2019.
    [35] E. Grave, A. Joulin, and N. Usunier, "Improving neural language models with a continuous cache," in International Conference on Learning Representations, 2017.
    [36] U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis, "Generalization through memorization: Nearest neighbor language models," in International Conference on Learning Representations, 2020.
    [37] J. Johnson, M. Douze, and H. Jégou, "Billion-scale similarity search with GPUs," in IEEE Transactions on Big Data, 2019.
    [38] S. Merity, N. S. Keskar, and R. Socher, "Regularizing and optimizing LSTM language models," in International Conference on Learning Representations 2018.
    [39] Z. Yang, Z. Dai, R. Salakhutdinov, and W. Cohen, "Breaking the softmax bottleneck: A high-rank RNN language model," in International Conference on Learning Representations, 2018.
    [40] B. Krause, E. Kahembwe, I. Murray, and S. Renals, "Dynamic evaluation of neural sequence models," in International Conference on Machine Learning, pp. 2766-2775, 2018.
    [41] C. Gong, D. He, X. Tan, T. Qin, L. Wang, and T. Liu, "Frage: Frequency-agnostic word representation," in Advances in neural information processing systems, pp. 1334-1345, 2018.
    [42] D. Wang, C. Gong, and Q. Liu, "Improving neural language modeling via adversarial training," arXiv preprint arXiv:1906.03805, 2019.
    [43] C. Wang, M. Li, and A. Smola, "Language models with transformers," arXiv preprint arXiv:1904.09408, 2019.
    [44] D. Povey et al., "The Kaldi speech recognition toolkit," in IEEE 2011 workshop on automatic speech recognition and understanding, 2011.
    [45] M. Mohri, F. Pereira, M. Riley, and Language, "Weighted finite-state transducers in speech recognition," in Computer Speech & Language, vol. 16, no. 1, pp. 69-88, 2002.
    [46] L. Li et al., "Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition," in Humaine association conference on affective computing and intelligent interaction, pp. 312-317, 2013.
    [47] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. Lang, "Phoneme recognition using time-delay neural networks," in IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 3, pp. 328-339, 1989.
    [48] M. Mohri, F. Pereira, and M. Riley, "Speech recognition with weighted finite-state transducers," in Springer Handbook of Speech Processing: Springer, pp. 559-584, 2008.
    [49] D. Povey et al., "Generating exact lattices in the WFST framework," in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4213-4216, 2012.
    [50] S. Watanabe et al., "Espnet: End-to-end speech processing toolkit," arXiv preprint arXiv:1804.00015, 2018.
    [51] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, pp. 369-376, 2006.
    [52] S. Kim, T. Hori, and S. Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," in IEEE international conference on acoustics, speech and signal processing, pp. 4835-4839, 2017.
    [53] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for end-to-end speech recognition," in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
    [54] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, "XLNet: Generalized Autoregressive Pretraining for Language Understanding," in Advances in neural information processing systems, pp. 5753-5763, 2019.
    [55] B. van Aken, B. Winter, A. Löser, and F. A. Gers, "How does bert answer questions? a layer-wise analysis of transformer representations," in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1823-1832, 2019.
    [56] M. Marcus, B. Santorini, and M. A. Marcinkiewicz, "Building a large annotated corpus of English: The Penn Treebank," 1993.
    [57] S. Merity, C. Xiong, J. Bradbury, and R. Socher, "Pointer sentinel mixture models," arXiv preprint arXiv:1609.07843, 2016.
    [58] A. Rousseau, P. Deléglise, and Y. Esteve, "Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks," in International Conference on Language Resources and Evaluation, pp. 3935-3939, 2014.
    [59] D. B. Paul and J. Baker, "The design for the Wall Street Journal-based CSR corpus," in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
    [60] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, "Front-end factor analysis for speaker verification," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2010.
    [61] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," in International Conference on Learning Representations, 2014.
    [62] A. Stolcke, "SRILM-an extensible language modeling toolkit," in Seventh international conference on spoken language processing, 2002.
    [63] K. Clark, M. Luong, Q. Le, and C. Manning, "Electra: Pre-training text encoders as discriminators rather than generators," in International Conference on Learning Representations, 2020.

    無法下載圖示 全文公開日期 2025/08/26 (校內網路)
    全文公開日期 2025/08/26 (校外網路)
    全文公開日期 2025/08/26 (國家圖書館:臺灣博碩士論文系統)
    QR CODE