簡易檢索 / 詳目顯示

研究生: 徐子杰
Tzu-Chieh Hsu
論文名稱: 基於Stacking與Transformer的中文斷詞模型之研究
Research on Chinese Word Segmentation Models based on Stacking and Transformer
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 蘇明祥
Ming-Hsiang Su
曾厚強
Hou-Chiang Tseng
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 77
中文關鍵詞: 中文斷詞系統集成模型變形模型
外文關鍵詞: Chinese Word Segmentation, Ensemble model, Transformer model
相關次數: 點閱:284下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

隨著自注意力機制的熱門,讓Transformer架構的模型成為現在的主流。而基於Transformer架構的模型,像是BERT、Roberta等等,藉由大量的資料和訓練任務做預訓練之後,就可以得到強大的預訓練模型,而這些預訓練模型只需要做一些微調,就可以在下游任務得到很好的效能。本論文將五種基於Transformer架構的預訓練模型做在中文斷詞任務,並且用四種不同的結合方式將模型倆倆結合。探討五種基本模型在中文斷詞任務中是否都有很好的效果,在結合後,是否能將兩個模型的優勢做一個結合並讓效能能再進一步的提升,本論文也將實驗結果做分析,分析每個模型錯誤分布,並且也從錯誤分析中發現PKU資料集本身有資料不一致的問題,所以導致此資料集的分數無法提升。


With the popularity of the self-attention mechanism, the model of the Transformer architecture has become the mainstream now. Models based on the Transformer architecture, such as BERT, Roberta, etc., can obtain powerful pre-training models after pre-training with a large amount of data and training tasks, and these pre-training models only need to do some fine-tuning. Good performance is obtained in downstream tasks. We use five pre-trained models based on the Transformer architecture to perform word segmentation tasks in Chinese, and combine the two models in four different ways. To explore whether the five basic models have good results in the Chinese word segmentation task, after combining, whether the advantages of the two models can be combined to further improve the performance, we will also analyze the experimental results , analyze the error distribution of each model, and also find that the PKU data set itself has data inconsistencies from the error analysis, so the score of this data set cannot be improved.

第 1 章 緒論 1.1 研究背景 1.1.1 自然語言處理 1.1.2 中文斷詞 第 2 章 相關研究 2.1 中文斷詞視為序列標記任務 2.2 Conditional Random Fields 2.3 BERT 2.4 ZEN 2.5 Roberta 2.6 Whole Word Masking 2.7 Chinese Word Segmentation with Wordhood Memory Network 2.8 集成學習 第 3 章 基於Stacking與Transformer的中文斷詞模型之研究 3.1 實驗訓練流程 3.2 實驗資料集 3.3 評分方式 3.4 實驗設定 第 4 章 實驗結果 4.1 基礎模型 4.2 組合模型 4.3 錯誤分析 4.4 錯誤探討 4.5 資料集探討 4.6 資料錯誤對模型效能影響之探討 4.7 組合模型與最先進模型之比較 第 5 章 結論與未來展望 第 6 章 參考文獻

[1] J. D. M.-W. C. Kenton and L. K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of NAACL-HLT, pp. 4171-4186, 2019.
[2] A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645-6649, 2013.
[3] B. H. Juang and L. R. J. T. Rabiner, "Hidden Markov models for speech recognition," Technometrics, vol. 33, no. 3, pp. 251-272, 1991.
[4] N. Xue, "Chinese word segmentation as character tagging," International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, pp. 29-48, 2003.
[5] J. Ma, K. Ganchev, and D. Weiss, "State-of-the-art Chinese Word Segmentation with Bi-LSTMs," Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4902-4908, 2018.
[6] C. Wang and B. Xu, "Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation," Proceedings of the Eighth International Joint Conference on Natural Language Processing vol. 1, pp. 163-172, 2017.
[7] X. Qiu, H. Pei, H. Yan, and X.-J. Huang, "A Concise Model for Multi-Criteria Chinese Word Segmentation with Transformer Encoder," Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2887-2897, 2020.
[8] A. Ratnaparkhi, "A maximum entropy model for part-of-speech tagging," Conference on empirical methods in natural language processing, 1996.
[9] P. Wang, Y. Qian, F. K. Soong, L. He, and H. J. a. p. a. Zhao, "Part-of-speech tagging with bidirectional long short-term memory recurrent neural network," arXiv preprint arXiv:1510.06168, 2015.
[10] Y. Tian et al., "Joint Chinese word segmentation and part-of-speech tagging via two-way attentions of auto-analyzed knowledge," Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8286-8296, 2020.
[11] R. Baeza-Yates and B. Ribeiro-Neto, Modern information retrieval. ACM press New York, 1999.
[12] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to information retrieval. Cambridge University Press Cambridge, 2008.
[13] M. Maybury, Advances in automatic text summarization. MIT press, 1999.
[14] G. Salton, A. Singhal, M. Mitra, C. J. I. p. Buckley, and management, "Automatic text structuring and summarization," Information processing & management, vol. 33, no. 2, pp. 193-207, 1997.
[15] P. F. Brown et al., "A statistical approach to machine translation," Computational linguistics, vol. 16, no. 2, pp. 79-85, 1990.
[16] P. Koehn, Statistical machine translation. Cambridge University Press, 2009.
[17] L. Hirschman and R. J. n. l. e. Gaizauskas, "Natural language question answering: the view from here," natural language engineering, vol. 7, no. 4, pp. 275-300, 2001.
[18] D. Weissenborn, G. Wiese, and L. Seiffe, "Making Neural QA as Simple as Possible but not Simpler," Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 271-280, 2017.
[19] J. Lafferty, A. McCallum, and F. C. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," ICML, vol. 18, pp. 282–289, 2001.
[20] L. R. J. P. o. t. I. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[21] L. E. Baum, "An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes."
[22] G. D. J. P. o. t. I. Forney, "The viterbi algorithm," Proceding of the IEEE, vol. 61, no. 3, pp. 268-278, 1973.
[23] A. McCallum, D. Freitag, and F. C. Pereira, "Maximum entropy Markov models for information extraction and segmentation," ICML, vol. 17, pp. 591-598, 2000.
[24] A. Berger, S. A. Della Pietra, and V. J. J. C. l. Della Pietra, "A maximum entropy approach to natural language processing," vol. 22, no. 1, pp. 39-71, 1996.
[25] J. N. Darroch and D. J. T. a. o. m. s. Ratcliff, "Generalized iterative scaling for log-linear models," The annals of mathematical statistics, pp. 1470-1480, 1972.
[26] F. Peng, F. Feng, and A. McCallum, "Chinese segmentation and new word detection using conditional random fields," COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pp. 562-568, 2004.
[27] P. Clifford and J. Hammersley, "Markov fields on finite graphs and lattices," 1971.
[28] A. Berger, "The improved iterative scaling algorithm: A gentle introduction," ed: Technical report, Carnegie Mellon University, 1997.
[29] A. Vaswani et al., "Attention is all you need," Advances in Neural Information Processing Systems vol. 30, 2017.
[30] D. Britz, A. Goldie, M.-T. Luong, and Q. Le, "Massive Exploration of Neural Machine Translation Architectures," Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1442-1451, 2017.
[31] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
[32] J. L. Ba, J. R. Kiros, and G. E. J. a. p. a. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[33] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding," Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353-355, 2018.
[34] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, "SQuAD: 100,000+ Questions for Machine Comprehension of Text," Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383-2392, 2016.
[35] P. Rajpurkar, R. Jia, and P. Liang, "Know What You Don’t Know: Unanswerable Questions for SQuAD," Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784-789, 2018.
[36] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi, "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference," Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 93–104, 2018.
[37] Y. Cui, W. Che, T. Liu, B. Qin, Z. J. I. A. T. o. A. Yang, Speech,, and L. Processing, "Pre-training with whole word masking for chinese bert," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3504-3514, 2021.
[38] S. Diao, J. Bai, Y. Song, T. Zhang, and Y. Wang, "ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations," Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4729-4740, 2020.
[39] Y. Liu et al., "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019.
[40] M. Ott, S. Edunov, D. Grangier, and M. Auli, "Scaling Neural Machine Translation," Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 1-9, 2018.
[41] Y. You, J. Li, J. Hseu, X. Song, J. Demmel, and C.-J. J. a. p. a. Hsieh, "Reducing BERT pre-training time from 3 days to 76 minutes," arXiv preprint arXiv:1904.00962, 2019.
[42] Y. Zhu et al., "Aligning books and movies: Towards story-like visual explanations by watching movies and reading books," Proceedings of the IEEE international conference on computer vision, pp. 19-27, 2015.
[43] S. J. U. h. w. a. o. s. h. c. o. n. Nagel, "Cc-news," 2016.
[44] A. Gokaslan, V. Cohen, E. Pavlick, and S. Tellex, "Openwebtext corpus," ed, 2019.
[45] T. H. Trinh and Q. V. J. a. p. a. Le, "A simple method for commonsense reasoning," 2018.
[46] G. Lample and A. J. a. p. a. Conneau, "Cross-lingual language model pretraining," 2019.
[47] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. J. A. i. n. i. p. s. Le, "Xlnet: Generalized autoregressive pretraining for language understanding," vol. 32, 2019.
[48] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. J. T. o. t. A. f. C. L. Levy, "SpanBERT: Improving Pre-training by Representing and Predicting Spans," vol. 8, pp. 64-77, 2020.
[49] R. Sennrich, B. Haddow, and A. Birch, "Neural Machine Translation of Rare Words with Subword Units," 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715-1725, 2016.
[50] W. Che, Z. Li, and T. Liu, "Ltp: A chinese language technology platform," Coling 2010: Demonstrations, pp. 13-16, 2010.
[51] X. Chen, X. Qiu, C. Zhu, P. Liu, and X.-J. Huang, "Long short-term memory neural networks for chinese word segmentation," Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1197-1206, 2015.
[52] S. Higashiyama et al., "Incorporating word attention into character-based word segmentation," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2699-2709, 2019.
[53] W. Pei, T. Ge, and B. Chang, "Max-margin tensor neural network for Chinese word segmentation," Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 293-303, 2014.
[54] H. Zhou, Z. Yu, Y. Zhang, S. Huang, X. Dai, and J. Chen, "Word-context character embeddings for chinese word segmentation," Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 760-766, 2017.
[55] Y. Tian, Y. Song, F. Xia, T. Zhang, and Y. Wang, "Improving Chinese word segmentation with wordhood memory networks," Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 8274-8285, 2020.
[56] A. Miller, A. Fisch, J. Dodge, A.-H. Karimi, A. Bordes, and J. Weston, "Key-Value Memory Networks for Directly Reading Documents," Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1400-1409, 2016.
[57] H. Feng, K. Chen, X. Deng, and W. J. C. l. Zheng, "Accessor variety criteria for Chinese word extraction," Computational linguistics, vol. 30, no. 1, pp. 75-93, 2004.
[58] M. Sun, D. Shen, and B. K. Tsou, "Chinese word segmentation without using lexicon and hand-crafted training data," 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 2, pp. 1265-1271, 1998.
[59] C. Kit and Y. Wilks, "Unsupervised learning of word boundary with description length gain," EACL 1999: CoNLL-99 Computational Natural Language Learning, 1999.
[60] L. J. A. i. r. Rokach, "Ensemble-based classifiers," Artificial intelligence review, vol. 33, no. 1, pp. 1-39, 2010.
[61] L. J. M. l. Breiman, "Bagging predictors," Machine learning, vol. 24, no. 2, pp. 123-140, 1996.
[62] Y. Freund and R. E. Schapire, "Experiments with a new boosting algorithm," ICML, vol. 96, pp. 148-156, 1996.
[63] D. H. J. N. n. Wolpert, "Stacked generalization," Neural networks, vol. 5, no. 2, pp. 241-259, 1992.
[64] I. Syarif, E. Zaluska, A. Prugel-Bennett, and G. Wills, "Application of bagging, boosting and stacking to intrusion detection," International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 593-602, 2012.
[65] R. Odegua, "An empirical study of ensemble techniques (bagging boosting and stacking)," Proc. Conf.: Deep Learn. IndabaXAt, 2019.
[66] T. Emerson, "The second international Chinese word segmentation bakeoff," Proceedings of the fourth SIGHAN workshop on Chinese language Processing, 2005.
[67] Y. J. T. t. m. Sasaki, "The truth of the F-measure," Teach tutor mater, vol. 1, no. 5, pp. 1-5, 2007.
[68] P. Jiang, D. Long, Y. Zhang, P. Xie, M. Zhang, and M. J. a. p. a. Zhang, "Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling," arXiv preprint arXiv:2210.15231, 2022.
[69] Y. Meng et al., "Glyce: Glyph-vectors for chinese character representations," Advances in Neural Information Processing Systems, vol. 32, 2019.

QR CODE