簡易檢索 / 詳目顯示

研究生: 郭家銍
Chia-Chih Kuo
論文名稱: 一套基於多語言BERT的去均值正則化框架以用於多語言問答系統
A Multilingual BERT-based Zero-mean Regularization Framework for Multilingual Question Answering
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 林伯慎
Bor-Shen Lin
陳柏琳
Berlin Chen
王新民
Hsin-Min Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 50
中文關鍵詞: 多語言問答系統零樣本學習多語言BERT
外文關鍵詞: multilingual, question answering, zero-resource, mBERT
相關次數: 點閱:173下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,多語言問答已經成為一項新興的研究議題,並引起了廣泛的關注。仰賴於各種基於深度學習的先進技術,面向英文及其他資源豐富的語言所開發的系統有著高度的發展,但面向資源匱乏的語言時,這些技術大多因為資料的匱乏而難以實現。因此,許多研究將多語言之基於轉換器的雙向編碼器表示法 (Bidirectional Encoder Representations from Transformers, BERT) 作為基礎,將資源豐富的語言上學習到的知識遷移至資源匱乏的語言上,旨在以零樣本(或少量樣本)的方式下,改善其在資源匱乏的語言上的成績。然而,許多近期的研究依然需要使用大量的未標記資料,以對資源匱乏的語言進行零樣本學習。有鑑於此,我們提出了一套去均值正則化 (Zero-mean Regularization, ZMR) 的資料增強框架,為了進一步減少不同語言表示法間固有的分布差異,我們也提出了一項輔助性的訓練目標。為了探索去均值正則化框架的潛力,我們提出了四套輔助性的方法,並將他們整合為ZMR-Hybrid系統。與數套需要數百萬未標記資料的基準系統進行比較,我們的ZMR-Hybrid系統在零資源的設定下(即不需要任何未標記資料)展現了強勁的零樣本成績,此框架亦對訓練階段所使用語言之成績有所幫助。


    In recent years, multilingual question answering has been an emergent research topic and has attracted much attention. Though systems for English and other rich-resource languages are highly developed and rely on various advanced deep learning-based techniques, most of them are impractical to be applied on low-resource languages due to the insufficiency of data. Therefore, many studies try to improve the performance of a low-resource language in a zero-shot (or few-shot) manner based on the multilingual Bidirectional Encoder Representations from Transformers by transferring the knowledge learned from rich-resource languages to the low-resource language. However, many recent tasks still require a large amount of unlabeled data for the zero-shot learning of low-resource languages. Accordingly, we propose a zero-mean regularization (ZMR) framework for data augmentation and an auxiliary objective for reducing the intrinsic distribution differences between representations of different languages. To explore the potential of the ZMR framework, we propose four auxiliary methods and eventually combine them to build a ZMR-Hybrid system. Compared to several baseline systems which require millions of unlabeled data, the performance of our ZMR-Hybrid system is not only highly comparable to zero-shot performance in a zero-resource setting (i.e., unlabeled data is not required) but is also better for languages used in training.

    1 Introduction 2 Background 3 Related Work 3.1 The Language Representation Methods 3.2 The Multilingual Pretraining of Language Representations 3.3 The Downstream Fine-tuning of Language Representations 4 Methodology 4.1 The Vanilla mBERT Method 4.2 Zero-mean Regularization (ZMR) 4.3 Auxiliary KL Divergence Objective 4.4 L2 Penalty for Length Regularization 4.5 Concatenation of Hidden States 4.6 Zero-meaned Token Detection (ZMTD) 4.7 ZMR-Hybrid System 5 Experimental Settings 5.1 Dataset 5.2 Evaluation Metrics 5.3 Implementation Details 6 Experimental Results 6.1 Zero-shot/Zero-resource Performance 6.2 Impact on the Training Languages 6.3 L2 Penalty for Length Regularization 6.4 Concatenation of Hidden States 6.5 Zero-meaned Token Detection (ZMTD) 6.6 ZMR-Hybrid System 7 Analysis and Discussion 7.1 Selection of the Baseline Model 7.2 Random Vectors as Mean Vectors 7.3 Principal Component Analysis (PCA) and Lengths of Token Embeddings 8 Conclusions and Future Work References Appendix

    [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
    [2] M. Artetxe and H. Schwenk, “Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and Beyond,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 597–610, 2019.
    [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186, 2019.
    [4] H. Huang, Y. Liang, N. Duan, M. Gong, L. Shou, D. Jiang, and M. Zhou, “Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2485–2494, 2019.
    [5] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, 2020.
    [6] W. Qi, Y. Gong, Y. Yan, C. Xu, B. Yao, B. Zhou, B. Cheng, D. Jiang, J. Chen, R. Zhang, H. Li, and N. Duan, “ProphetNet-X: Large-scale pre-training models for English, Chinese, multi-lingual, dialog, and code generation,” arXiv preprint arXiv:2104.08006, 2021.
    [7] S. Wu and M. Dredze, “Are all languages created equal in multilingual BERT?” Proceedings of the 5th Workshop on Representation Learning for NLP, pp. 120–130, 2020.
    [8] A. Rahimi, Y. Li, and T. Cohn.”Massively multilingual transfer for NER,” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 151–164, 2019.
    [9] T. Hsu, C. Liu, and H. Lee, “Zero-shot reading comprehension by cross-lingual transfer learning with multi-lingual language representation model,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5933–5940, 2019.
    [10] B. Zhang, P. Williams, I. Titov, and R. Sennrich, “Improving massively multilingual neural machine translation and zero-shot translation,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1628–1639, 2020.
    [11] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou, “Word translation without parallel data,” arXiv preprint arXiv:1710.04087, 2018.
    [12] P. Keung, Y. Lu, and V. Bhardwaj, “Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1355–1360, 2019.
    [13]C. Liu, T. Hsu, Y. Chuang, and H. Lee, “A study of cross-lingual ability and language-specific information in multilingual BERT,” arXiv preprint arXiv:2004.09205, 2020.
    [14] C. Liu, T. Hsu, Y. Chuang, C. Li, and H. Lee, “Language representation in multilingual BERT and its applications to improve cross-lingual generalization,” arXiv preprint arXiv:2010.10041, 2020.
    [15] M. Xia, G. Zheng, S. Mukherjee, M. Shokouhi, G. Neubig, and A. H. Awadallah, “MetaXL: Meta representation transformation for low-resource cross-lingual learning,” arXiv preprint arXiv:2104.07908, 2021.
    [16] T. Rama, L. Beinborn, and S. Eger, “Probing multilingual BERT for genetic and typological signals,” Proceedings of the 28th International Conference on Computational Linguistics, pp. 1214–1228. 2020.
    [17] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki, “TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 454–470, 2020.
    [18]P. Lewis, B. Oguz, R. Rinott, S. Riedel, and H. Schwenk, “MLQA: Evaluating cross-lingual extractive question answering,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7315–7330, 2020.
    [19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.
    [20] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “HuggingFace's Transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
    [21] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2019.
    [22] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186, 2019.
    [23] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
    [24] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” Proceedings of the 1st International Conference on Learning Representations, 2013.
    [25] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543, 2014.

    [26] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 2227–2237, 2018.
    [27] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019.
    [28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” arXiv preprint arXiv:1906.08237, 2019.
    [29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach”, arXiv preprint arXiv:1907.11692, 2019.
    [30] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A lite BERT for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
    [31] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou, “Word translation without parallel data,” Proceedings of the 6th International Conference on Learning Representations, 2018.
    [32] M. Artetxe and H. Schwenk, “Massively multilingual sentence embeddings for zeroshot cross-lingual transfer and beyond,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 597–610, 2019.
    [33] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” Advances in Neural Information Processing Systems, pp. 7057–7067, 2019.
    [34] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, 2020.
    [35] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, 2020.
    [36] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing,” arXiv preprint arXiv:1808.06226, 2018.
    [37] Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X. L. Mao, H. Huang, and M. Zhou, “InfoXLM: An information-theoretic framework for cross-lingual language model pre-training,” arXiv preprint arXiv:2007.07834, 2020.
    [38] K. Clark, M. T. Luong, Q. V. Le, C. D. Manning, “ELECTRA: Pre-training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020.
    [39] Z. Chi, S. Huang, L. Dong, S. Ma, S. Singhal, P. Bajaj, X. Song, and F. Wei, “XLM-E: Cross-lingual Language Model Pre-training via ELECTRA,” arXiv preprint arXiv:2106.16138, 2021.

    QR CODE