應用語境樣式分群於文件檢索中的查詢精煉之研究｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	紀至鍇 Zhi-kai JI
論文名稱：	應用語境樣式分群於文件檢索中的查詢精煉之研究 Context Pattern Clustering Applied to Query Refinement for Text Retrieval
指導教授：	林伯慎 Bor-Shen Lin
口試委員:	楊傳凱 Chuan-Kai Yang 古鴻炎 Hung- yan Gu
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理系 Department of Information Management
論文出版年：	2013
畢業學年度：	102
語文別：	中文
論文頁數：	60
中文關鍵詞：	語境樣式、語境分群、詞義分析、主題分析、查詢擴展、查詢精煉、文件檢索
外文關鍵詞：	context pattern, context clustering, word senses analysis, topic analysis, query expansion, query refinement, document retrieval
相關次數：	點閱：254 下載：5
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本研究的主要目標是利用語境樣式分群，找出能包含查詢詞主要意涵的語境群集；接下來，再各群集中找出查詢詞較具有鑑別力的語境詞彙。這些詞彙可用來精煉查詢，以期增加文件檢索的精確度，或是用於輔助查詢詞的詞義分析。我們先使用「蘋果」相關的文件集作為基準的檢索實驗，單用查詢詞前20名文件純度為0.85。語境樣式分群實驗顯示，GMM分群的F度量值高於k-means和聚合式演算法。此外，群集未精煉和精煉後的鑑別性詞彙，在語境樣式類別數量均勻情況下，可分別達到0.96和0.973的檢索純度。而語境樣式類別不均勻情況下，也可分別達到0.93和0.973。這代表群集的精煉對找到具有鑑別力的語境詞有幫助。在鑑別性詞彙排序條件中，使用詞頻和修正的反文件頻之乘積可以達到最佳的檢索效能。最後，我們使用了「火箭」、「長榮」、「菲律賓」、「過期」等詞彙進行語境分群和群集意涵分析。我們發現本方法可以協助使用者來辨別詞彙的意涵或相關主題情境，可以應用在事件追蹤或詞義解析的輔助上。

The aim of this research is to cluster the context patterns of the query and resolve the relevant topics of documents according the most discriminative context words for the clusters. First, context patterns are extracted from documents and converted into feature space by latent semantic indexing, in which pattern clustering is conducted. These clusters are then refined to remove the patterns close to the boundary, and the words for the remaining patterns within each cluster are extracted and sorted so as to obtain the most discriminative context words. These words are further used to re-ranking the documents and the purities of the top 20 documents are evaluated. Experimental results based on the context patterns of the query “apple” shows, GMM-based clustering can achieve significantly better clustering performance (F-measure) than K-means or agglomerative clustering, since GMM is based on parametric model that has better generalization capability and takes into account the distance normalization by covariance. In addition, the discriminative context words added into the query individually can improve the retrieval performance (purity of top 20 documents) from 0.85 to 0.96, and cluster refinement can further improve the purity to 0.973. When the proposed scheme was used to analyze the clusters of the context patterns for such queries as “rocket”, “Eva”, “expired” and “Philippine”, the relevant topics can be resolved effectively, which may help to induce the word senses or track the events of various topics.

第一章、序論	1
1研究動機	1
2論文主要成果	2
3論文組織與架構	3
第二章、文獻與技術背景	4
1向量空間模型	4
2潛藏語意索引	6
3 k-means分群演算法	8
4階層式分群演算法	9
5高斯混合模型分群演算法	10
6 F度量群集評估指標	11
7語境資訊用於檢索之文獻介紹	13
9本章摘要	13
第三章、語境樣式之建立與分群評估	14
1文件前處理	14
2語境樣式的建立	17
3語境樣式的分群	19
4本章摘要	23
第四章、鑑別性詞彙抽取與檢索之改進	24
1語境樣式群集的精煉	24
2鑑別性詞彙擷取	27
3檢索效能評估	28
4實驗與分析	30
5語境樣式類別不均勻的情況	32
5本章摘要	34
第五章、語境分群應用於主題分析	35
1詞彙「火箭」的分析	35
2詞彙「長榮」的分析	38
3詞彙「菲律賓」的分析	40
4詞彙「過期」的分析	43
5本章摘要	46
第六章、結論	47
參考文獻	48

                                

[1]中央研究院資訊科學所詞庫小組中文斷詞系統(http://ckipsvr.iis.sinica.edu.tw/)。
[2]中研院平衡語料庫詞類標記集(http://ckipsvr.iis.sinica.edu.tw/papers/category_list.doc)。
[3]Salton ,G., Wong, A., Yang,C. S., “Vector Space Model for Automatic Indexing”, Communications of the ACM, 18(11), 613-620, 1975.
[4]Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T., & Harshman, R. , “Indexing by Latent Semantic Analysis “, Journal of the American Society for
Information Science, 41(6), 391-407. , 1990 .
[5]Landauer, T. K., Foltz, P. W., & Laham, D. ,“Introduction to Latent Semantic Analysis” , Discourse Processes, 25, 259-284. ,1998.
[6]Jing Bai, Jian-Yun Nie, Hugues Bouchard, Guihong Cao, “Using Query Contexts in Information Retrieval”, SIGIR’07, July 23–27, 2007.
[7]Goole Personalized Search, http://www.google.com/psearch.
[8]Teevan, J., Dumais, S.T., Horvitz, E., “Personalizing search via automated analysis of interests and activities”, SIGIR’05, pp. 449-456, 2005.
[9]"Kim, H.-R., Chan, P.K., “Personalized ranking of search results with learned user interest hierarchies from bookmarks”, WEBKDD’05 Workshopat ACM-KDD, pp. 32-43, 2005.
[10]Teevan, J., Dumais, S.T., Horvitz, E., “Personalizing search via automated analysis of interests and activities”, SIGIR’05, pp. 449-456, 2005 .
[11]Bai, J., Nie, J.Y., Cao, G., “Context-dependent term relations for information retrieval”, EMNLP’06, pp. 551-559, 2006.
[12]張家瑋，「英語短文語意相似度評估演算法」，碩士論文，成功大學，台北， 2012。
[13]王文祺，「應用文建重排序與局部查訊擴展於中文文件檢索之研究」，碩士論文，台灣科技大學，台北，2007。
[14]Carpineto, C. and Romano, G,” A survey of automatic query expansion in information retrieval”, ACM Comput. Surv. 44, 1, 50 pages, 2012.
[15]Jiafeng Guo, Gu Xu, Hang Li, Xueqi Cheng, “A Unified and Discriminative Model for Query Refinement”, SIGIR’08, July 20–24, 2008.
[16]Jiawei, H. , Micheline, K., “Data Mining:Concepts and Techniques 2e”, Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University ,2008
[17]Van Rijsbergen, C. J. Information Retrieval (2nd ed.). Butterworth-Heinemann Newton, MA, USA, 1979
[18]Celeux and Govaert, “Gaussian Parsimonious Clustering Models”, Pattern Recognition, Vol. 28, No. 5, pp. 781 -793, 1995.

簡易檢索 / 詳目顯示

相關論文