簡易檢索 / 詳目顯示

研究生: 紀至鍇
Zhi-kai JI
論文名稱: 應用語境樣式分群於文件檢索中的查詢精煉之研究
Context Pattern Clustering Applied to Query Refinement for Text Retrieval
指導教授: 林伯慎
Bor-Shen Lin
口試委員: 楊傳凱
Chuan-Kai Yang
古鴻炎
Hung- yan Gu
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2013
畢業學年度: 102
語文別: 中文
論文頁數: 60
中文關鍵詞: 語境樣式語境分群詞義分析主題分析查詢擴展查詢精煉文件檢索
外文關鍵詞: context pattern, context clustering, word senses analysis, topic analysis, query expansion, query refinement, document retrieval
相關次數: 點閱:254下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究的主要目標是利用語境樣式分群,找出能包含查詢詞主要意涵的語境群集;接下來,再各群集中找出查詢詞較具有鑑別力的語境詞彙。這些詞彙可用來精煉查詢,以期增加文件檢索的精確度,或是用於輔助查詢詞的詞義分析。我們先使用「蘋果」相關的文件集作為基準的檢索實驗,單用查詢詞前20名文件純度為0.85。語境樣式分群實驗顯示,GMM分群的F度量值高於k-means和聚合式演算法。此外,群集未精煉和精煉後的鑑別性詞彙,在語境樣式類別數量均勻情況下,可分別達到0.96和0.973的檢索純度。而語境樣式類別不均勻情況下,也可分別達到0.93和0.973。這代表群集的精煉對找到具有鑑別力的語境詞有幫助。在鑑別性詞彙排序條件中,使用詞頻和修正的反文件頻之乘積可以達到最佳的檢索效能。最後,我們使用了「火箭」、「長榮」、「菲律賓」、「過期」等詞彙進行語境分群和群集意涵分析。我們發現本方法可以協助使用者來辨別詞彙的意涵或相關主題情境,可以應用在事件追蹤或詞義解析的輔助上。


    The aim of this research is to cluster the context patterns of the query and resolve the relevant topics of documents according the most discriminative context words for the clusters. First, context patterns are extracted from documents and converted into feature space by latent semantic indexing, in which pattern clustering is conducted. These clusters are then refined to remove the patterns close to the boundary, and the words for the remaining patterns within each cluster are extracted and sorted so as to obtain the most discriminative context words. These words are further used to re-ranking the documents and the purities of the top 20 documents are evaluated. Experimental results based on the context patterns of the query “apple” shows, GMM-based clustering can achieve significantly better clustering performance (F-measure) than K-means or agglomerative clustering, since GMM is based on parametric model that has better generalization capability and takes into account the distance normalization by covariance. In addition, the discriminative context words added into the query individually can improve the retrieval performance (purity of top 20 documents) from 0.85 to 0.96, and cluster refinement can further improve the purity to 0.973. When the proposed scheme was used to analyze the clusters of the context patterns for such queries as “rocket”, “Eva”, “expired” and “Philippine”, the relevant topics can be resolved effectively, which may help to induce the word senses or track the events of various topics.

    第一章、序論 1 1.1研究動機 1 1.2論文主要成果 2 1.3論文組織與架構 3 第二章、文獻與技術背景 4 2.1向量空間模型 4 2.2潛藏語意索引 6 2.3 k-means分群演算法 8 2.4階層式分群演算法 9 2.5高斯混合模型分群演算法 10 2.6 F度量群集評估指標 11 2.7語境資訊用於檢索之文獻介紹 13 2.9本章摘要 13 第三章、語境樣式之建立與分群評估 14 3.1文件前處理 14 3.2語境樣式的建立 17 3.3語境樣式的分群 19 3.4本章摘要 23 第四章、鑑別性詞彙抽取與檢索之改進 24 4.1語境樣式群集的精煉 24 4.2鑑別性詞彙擷取 27 4.3檢索效能評估 28 4.4實驗與分析 30 4.5語境樣式類別不均勻的情況 32 4.5本章摘要 34 第五章、語境分群應用於主題分析 35 5.1詞彙「火箭」的分析 35 5.2詞彙「長榮」的分析 38 5.3詞彙「菲律賓」的分析 40 5.4詞彙「過期」的分析 43 5.5本章摘要 46 第六章、結論 47 參考文獻 48

    [1]中央研究院資訊科學所詞庫小組中文斷詞系統(http://ckipsvr.iis.sinica.edu.tw/)。
    [2]中研院平衡語料庫詞類標記集(http://ckipsvr.iis.sinica.edu.tw/papers/category_list.doc)。
    [3]Salton ,G., Wong, A., Yang,C. S., “Vector Space Model for Automatic Indexing”, Communications of the ACM, 18(11), 613-620, 1975.
    [4]Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T., & Harshman, R. , “Indexing by Latent Semantic Analysis “, Journal of the American Society for
    Information Science, 41(6), 391-407. , 1990 .
    [5]Landauer, T. K., Foltz, P. W., & Laham, D. ,“Introduction to Latent Semantic Analysis” , Discourse Processes, 25, 259-284. ,1998.
    [6]Jing Bai, Jian-Yun Nie, Hugues Bouchard, Guihong Cao, “Using Query Contexts in Information Retrieval”, SIGIR’07, July 23–27, 2007.
    [7]Goole Personalized Search, http://www.google.com/psearch.
    [8]Teevan, J., Dumais, S.T., Horvitz, E., “Personalizing search via automated analysis of interests and activities”, SIGIR’05, pp. 449-456, 2005.
    [9]"Kim, H.-R., Chan, P.K., “Personalized ranking of search results with learned user interest hierarchies from bookmarks”, WEBKDD’05 Workshopat ACM-KDD, pp. 32-43, 2005.
    [10]Teevan, J., Dumais, S.T., Horvitz, E., “Personalizing search via automated analysis of interests and activities”, SIGIR’05, pp. 449-456, 2005 .
    [11]Bai, J., Nie, J.Y., Cao, G., “Context-dependent term relations for information retrieval”, EMNLP’06, pp. 551-559, 2006.
    [12]張家瑋,「英語短文語意相似度評估演算法」,碩士論文,成功大學,台北, 2012。
    [13]王文祺,「應用文建重排序與局部查訊擴展於中文文件檢索之研究」,碩士論文,台灣科技大學,台北,2007。
    [14]Carpineto, C. and Romano, G,” A survey of automatic query expansion in information retrieval”, ACM Comput. Surv. 44, 1, 50 pages, 2012.
    [15]Jiafeng Guo, Gu Xu, Hang Li, Xueqi Cheng, “A Unified and Discriminative Model for Query Refinement”, SIGIR’08, July 20–24, 2008.
    [16]Jiawei, H. , Micheline, K., “Data Mining:Concepts and Techniques 2e”, Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University ,2008
    [17]Van Rijsbergen, C. J. Information Retrieval (2nd ed.). Butterworth-Heinemann Newton, MA, USA, 1979
    [18]Celeux and Govaert, “Gaussian Parsimonious Clustering Models”, Pattern Recognition, Vol. 28, No. 5, pp. 781 -793, 1995.

    QR CODE