簡易檢索 / 詳目顯示

研究生: 邱諭芳
Yu-Fang Chiu
論文名稱: 詞內嵌空間中複合詞的探索
Exploring the Compound Words in the Embedding Space
指導教授: 林伯慎
Bor-Shen Lin
口試委員: 楊傳凱
Chuan-Kai Yang
羅乃維
Nai-Wei Lo
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 56
中文關鍵詞: 詞內嵌模型複合詞語意空間密度鄰居熵語意偏移語言教學
外文關鍵詞: compound word, neighbor entropy, semantic shifting
相關次數: 點閱:154下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 詞內嵌模型是透過機器學習的演算法將詞語投射到向量空間,最著名的例子是Skipgram模型;由於詞內嵌空間能保留詞語的語意鑑別力,因此在推薦系統、商品檢索、或是詞義近似搜尋上都有不錯的表現。過去的研究雖然驗證了此模型的有效,然而對於其語意空間特性卻沒有深入分析解釋。另外,複合詞對於主題領域的概念是很重要的表示單元,例如artificial intelligence或super bowl都是複合詞,而其意義和用法並非組成單詞的語義所能涵蓋。因此,傳統基於單詞的詞內嵌空間,並不足以描述複合詞的語義特性。基於上述想法,本研究探討了複合詞篩選、複合詞內嵌空間訓練,並提出了內嵌空間的三種度量方法,用來探討複合詞在語意空間中的特性。我們先計算詞彙鄰近橢圓面積用來衡量空間密度,根據分析結果可以觀察到,複合詞密度愈小,其意義更加廣泛模糊,而空間密度愈大則意義愈明確與限定。接著,我們定義詞彙的鄰居熵,其值愈大表該詞彙的語境愈複雜,其值愈小代表該詞彙愈屬於特定領域。最後,我們定義了複合詞與其組成單詞間的語意偏移量;偏移量愈大代表複合詞的意義已經偏離了原來單詞的語義,語言學習者不易自行推想其意義。這三種度量可用以篩選具不同特性的複合詞,例如較泛用或較特定、較難或較易理解,潛在可應用於語言教學、專業詞典編纂、以及語意搜尋等範疇。


    Word embedding models such as Skipgram are able to map the words into a vector space. Since the embedding space can retain the semantic discrimination of words, it can achieve good performance in such applications as recommendation systems. Although past researches have verified the effectiveness of such models, there is not yet effective metric or tool for analyzing the characteristics of the words in the embedding space. On the other hand, compound words, such as super bowl, are important concepts in the subject area, but their meanings or usages might be quite different from their constituent words. Traditional word embedding model is hence not enough for describing the semantics of the compound words. Accordingly, this research explores the training and analysis of the embedding model with compound words. Three metrics are proposed to analyze the characteristics of the compound words. The first one is the area of the ellipse that is computed from the neighbors of a compound word to measure its spatial density; the smaller the density is, the more extensive and ambiguous its meaning is, and vice versa. The second one is the neighbor entropy, which denotes the context complexity of a word; the smaller the entropy is, the more domain-specific that word is. The third one is the semantic shift between the compound word and its constituent words; the larger the semantic shift, the greater the meaning of the compound word deviates from the semantics of its constituent words, and it is more difficult for the learners to infer the meaning of that compound word. These three metrics could be used to sort and filter the compound words according to desired properties, such as more general or specific, and more difficult or easier to understand. They are potentially applicable to language teaching, professional dictionary editing, and semantic search.

    目錄 第1章 序論 1 1.1 研究背景 1 1.2 研究動機與目標 1 1.3 論文組織與架構 2 第2章 文獻回顧 3 2.1 語意分析相關研究 3 2.2 神經網路與語言模型相關研究 4 2.3 相互資訊(Mutual Information, MI) 5 2.4 Word2Vec 6 2.5 複合詞(Compound word) 8 2.6 語意偏離度量 9 2.7 本章摘要 11 第3章 複合詞語意向量的分析 12 3.1 資料預處理 14 3.2 內嵌模型訓練 16 3.3 資料分析 17 3.3.1 複合詞語意空間分布 17 3.3.2 鄰居熵 23 3.3.3 語意偏離角度 26 3.3.4 本章摘要 30 第4章 複合詞的過濾與分析 31 4.1 出現次數門檻值對篩選複合詞的影響 31 4.2 微調訓練 32 4.3 複合詞的排序篩選 36 4.4 複合詞的語意偏移特性分析 38 第5章 結論 42 參考文獻 44

    [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). "Efficient estimation of word representations in vector space. " arXiv preprint, arXiv:1301.3781.
    [2] Le, Q. & Mikolov, T. (2014). "Distributed representations of sentences and documents. " Proceedings of the 31st International Conference on Machine Learning, (ICML-14): 1188-1196.
    [3] Church, K., & Hanks, P. (1989). "Word association norms, mutual information, and lexicography". In Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, Vancouver, British Columbia, pp. 76-83.
    [4] Bahl, L. R.; Jelinek, E; and Mercer, R. L. (1983). "A maximum likelihood approach to continuous speech recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2), 179-190.
    [5] Brown, P. E; Cocke, J.; DellaPietra, S. A.; DellaPietra, V. J.; Jelinek, E; Lafferty, J. D.; Mercer, R. L.; and Roossin, P. S. (1990). "A statistical approach to machine translation." Computational Linguistics, 16(2), 79-85.
    [6] Mays, E.; Damerau, E J.; and Mercer, R. L. (1990). "Context-based spelling correction." In Proceedings, IBM Natural Language ITL. Paris, France, 517-522.
    [7] Sparck Jones, K. "A statistical interpretation of term specificity and its application in retrieval." J. Doc. 1972, 28, 11–21.
    [8] Ramos J (2003) "Using TF-IDF to determine word relevance in document queries. " In: Proc. of the first int. conf. on machine learning
    [9] Trstenjak B, Mikac S, Donko D (2014) "KNN with TF-IDF based framework for text categorization. " Procedia Engineering (2014) 69:1356–1364.
    [10] Cavnar, W.B. & Trenkle, J.M. (1994). "N-gram-based text categorization." Ann arbor MI, 48113(2), 161-175.
    [11] Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. 2003. "A neural probabilistic language model." Journal of machine learning research, 3(Feb): 1137-1155
    [12] R.T.-W. Lo, B. He, and I. Ounis. "Automatically building a stopword list for an information retrieval system," Preceddings of The 5th Dutch-Belgian Workshop on Information Retrieval(DIR), Utrecht, Dutch, 2005, pp. 3-8
    [13] Python NLTK, https://www.nltk.org/
    [14] Hofmann T. "Probabilistic latent semantic analysis[C] "//Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. [S.l.]: Morgan Kaufmann Publishers Inc., 1999: 289–296.
    [15] Deerwester S, Dumais S T, Furnas G W, et al. "Indexing by latent semantic analysis[J]. " Journal of the American society for information science, 1990, 41(6): 391.
    [16] Blei D M, Ng A Y, Jordan M I. "Latent dirichlet allocation[J]. " Journal of machine Learning research, 2003, 3(Jan): 993–1022.
    [17] Ullmann, Stephen. "Semantics an Introduction to the Science of Meaning. "Barnes & Noble, 1979.
    [18] Solomon Kullback and Richard A Leibler. "On information and sufficiency. " The annals of mathematical statistics, 22(1):79–86, 1951
    [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space, " arXiv preprint arXiv: 1301.3781, 2013.
    [20] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. "ArnetMiner: Extraction and Mining of Academic Social Networks." In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008). pp.990-998.
    [21] Gensim: Deep learning with word2vec, https://radimrehurek.com/gensim/models/word2vec.html (Viewed on Nov, 29, 2020)

    QR CODE