研究生: |
邱諭芳 Yu-Fang Chiu |
---|---|
論文名稱: |
詞內嵌空間中複合詞的探索 Exploring the Compound Words in the Embedding Space |
指導教授: |
林伯慎
Bor-Shen Lin |
口試委員: |
楊傳凱
Chuan-Kai Yang 羅乃維 Nai-Wei Lo |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 56 |
中文關鍵詞: | 詞內嵌模型 、複合詞 、語意空間密度 、鄰居熵 、語意偏移 、語言教學 |
外文關鍵詞: | compound word, neighbor entropy, semantic shifting |
相關次數: | 點閱:154 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
詞內嵌模型是透過機器學習的演算法將詞語投射到向量空間,最著名的例子是Skipgram模型;由於詞內嵌空間能保留詞語的語意鑑別力,因此在推薦系統、商品檢索、或是詞義近似搜尋上都有不錯的表現。過去的研究雖然驗證了此模型的有效,然而對於其語意空間特性卻沒有深入分析解釋。另外,複合詞對於主題領域的概念是很重要的表示單元,例如artificial intelligence或super bowl都是複合詞,而其意義和用法並非組成單詞的語義所能涵蓋。因此,傳統基於單詞的詞內嵌空間,並不足以描述複合詞的語義特性。基於上述想法,本研究探討了複合詞篩選、複合詞內嵌空間訓練,並提出了內嵌空間的三種度量方法,用來探討複合詞在語意空間中的特性。我們先計算詞彙鄰近橢圓面積用來衡量空間密度,根據分析結果可以觀察到,複合詞密度愈小,其意義更加廣泛模糊,而空間密度愈大則意義愈明確與限定。接著,我們定義詞彙的鄰居熵,其值愈大表該詞彙的語境愈複雜,其值愈小代表該詞彙愈屬於特定領域。最後,我們定義了複合詞與其組成單詞間的語意偏移量;偏移量愈大代表複合詞的意義已經偏離了原來單詞的語義,語言學習者不易自行推想其意義。這三種度量可用以篩選具不同特性的複合詞,例如較泛用或較特定、較難或較易理解,潛在可應用於語言教學、專業詞典編纂、以及語意搜尋等範疇。
Word embedding models such as Skipgram are able to map the words into a vector space. Since the embedding space can retain the semantic discrimination of words, it can achieve good performance in such applications as recommendation systems. Although past researches have verified the effectiveness of such models, there is not yet effective metric or tool for analyzing the characteristics of the words in the embedding space. On the other hand, compound words, such as super bowl, are important concepts in the subject area, but their meanings or usages might be quite different from their constituent words. Traditional word embedding model is hence not enough for describing the semantics of the compound words. Accordingly, this research explores the training and analysis of the embedding model with compound words. Three metrics are proposed to analyze the characteristics of the compound words. The first one is the area of the ellipse that is computed from the neighbors of a compound word to measure its spatial density; the smaller the density is, the more extensive and ambiguous its meaning is, and vice versa. The second one is the neighbor entropy, which denotes the context complexity of a word; the smaller the entropy is, the more domain-specific that word is. The third one is the semantic shift between the compound word and its constituent words; the larger the semantic shift, the greater the meaning of the compound word deviates from the semantics of its constituent words, and it is more difficult for the learners to infer the meaning of that compound word. These three metrics could be used to sort and filter the compound words according to desired properties, such as more general or specific, and more difficult or easier to understand. They are potentially applicable to language teaching, professional dictionary editing, and semantic search.
[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). "Efficient estimation of word representations in vector space. " arXiv preprint, arXiv:1301.3781.
[2] Le, Q. & Mikolov, T. (2014). "Distributed representations of sentences and documents. " Proceedings of the 31st International Conference on Machine Learning, (ICML-14): 1188-1196.
[3] Church, K., & Hanks, P. (1989). "Word association norms, mutual information, and lexicography". In Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, Vancouver, British Columbia, pp. 76-83.
[4] Bahl, L. R.; Jelinek, E; and Mercer, R. L. (1983). "A maximum likelihood approach to continuous speech recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2), 179-190.
[5] Brown, P. E; Cocke, J.; DellaPietra, S. A.; DellaPietra, V. J.; Jelinek, E; Lafferty, J. D.; Mercer, R. L.; and Roossin, P. S. (1990). "A statistical approach to machine translation." Computational Linguistics, 16(2), 79-85.
[6] Mays, E.; Damerau, E J.; and Mercer, R. L. (1990). "Context-based spelling correction." In Proceedings, IBM Natural Language ITL. Paris, France, 517-522.
[7] Sparck Jones, K. "A statistical interpretation of term specificity and its application in retrieval." J. Doc. 1972, 28, 11–21.
[8] Ramos J (2003) "Using TF-IDF to determine word relevance in document queries. " In: Proc. of the first int. conf. on machine learning
[9] Trstenjak B, Mikac S, Donko D (2014) "KNN with TF-IDF based framework for text categorization. " Procedia Engineering (2014) 69:1356–1364.
[10] Cavnar, W.B. & Trenkle, J.M. (1994). "N-gram-based text categorization." Ann arbor MI, 48113(2), 161-175.
[11] Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. 2003. "A neural probabilistic language model." Journal of machine learning research, 3(Feb): 1137-1155
[12] R.T.-W. Lo, B. He, and I. Ounis. "Automatically building a stopword list for an information retrieval system," Preceddings of The 5th Dutch-Belgian Workshop on Information Retrieval(DIR), Utrecht, Dutch, 2005, pp. 3-8
[13] Python NLTK, https://www.nltk.org/
[14] Hofmann T. "Probabilistic latent semantic analysis[C] "//Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. [S.l.]: Morgan Kaufmann Publishers Inc., 1999: 289–296.
[15] Deerwester S, Dumais S T, Furnas G W, et al. "Indexing by latent semantic analysis[J]. " Journal of the American society for information science, 1990, 41(6): 391.
[16] Blei D M, Ng A Y, Jordan M I. "Latent dirichlet allocation[J]. " Journal of machine Learning research, 2003, 3(Jan): 993–1022.
[17] Ullmann, Stephen. "Semantics an Introduction to the Science of Meaning. "Barnes & Noble, 1979.
[18] Solomon Kullback and Richard A Leibler. "On information and sufficiency. " The annals of mathematical statistics, 22(1):79–86, 1951
[19] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space, " arXiv preprint arXiv: 1301.3781, 2013.
[20] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. "ArnetMiner: Extraction and Mining of Academic Social Networks." In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008). pp.990-998.
[21] Gensim: Deep learning with word2vec, https://radimrehurek.com/gensim/models/word2vec.html (Viewed on Nov, 29, 2020)