研究生: |
劉易昇 YI-SHENG LOU |
---|---|
論文名稱: |
基於PageRank之文件分群與文件視覺化方法研究 Document Clustering and Visualization of Documents Based on PageRank |
指導教授: |
林伯慎
Bor-Shen Lin |
口試委員: |
古鴻炎
Hung- yan Gu 楊傳凱 Chuan-Kai Yang |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2014 |
畢業學年度: | 102 |
語文別: | 中文 |
論文頁數: | 118 |
中文關鍵詞: | PageRank分群 、文件分群 、文件視覺化 |
外文關鍵詞: | PageRank Based Clustering, Document Clustering, Visualization of Documents |
相關次數: | 點閱:367 下載:10 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究提出一種基於PageRank的文件分群與文件視覺化方法,可以用來分析文件集,讓人們可以更快速理解文件集的主要特徵與大致內容;並可以透過放大(Zoom In)的概念,更深入了解某個群集再做細部的文件分群,以找出文件集中的重要脈絡。我們也提出了兩種評估指標:緊密度與連結度,來度量分群效果。緊密度是衡量群集內兩兩資料點的平均相似度,而連結度是衡量群集內資料之最小生成樹的平均連結度。連結度的指標有助於找出在空間中延伸較大範圍但彼此連結的資料群集,這類群集的緊密度通常稍低。我們對不同類型的文件集進行文件分群實驗,發現PageRank文件分群方法可以比K-Means文件分群方法獲得更好的連結度效能。最後,將PageRank分群方法所得到群集中含有資料點的權重、資料點的連結等脈絡資訊,幫助文件分群視覺化。
In this paper, we proposes a document clustering and visualization scheme with PageRank-based agglomerative clustering. This approach can be used to analyze document sets such that people may grasp the main topics or issues within a document set quickly. In addition, two metrics, including compactness and connectivity, are defined to measure the quality of document clusters. Experimental results show that PageRank-based approach outperforms k-means-based approach on both metrics by aggregating data strictly and eliminating outliers effectively. This scheme has been primarily tested on several document sets and satisfactory analysis results can be obtained. Visualization of 1,000 sport news based on this scheme was further given to show its applicability.
[1] S. Brin and L. Page (1998). The anatomy of a large-scale hypertextual Web searchengine. Computer Networks and ISDN Systems 30, 107–117.
[2] G. Salton, A. Wong and C. S. Yang (1975). A Vector Space Model for Automatic Indexing. Commun. ACM, Vol. 18, pp.613-620.
[3] R. Baeza-Yates and B. Ribeiro-Neto (1999). Modern Information Retrieval. Addison Wesley.
[4] C. Combes and J. Azema (2013). Clustering using principal component analysis applied to autonomy–disability of elderly people. Decision Support Systems, vol. 55, pp. 578-586, 5.
[5] R. Sibson (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 16 (1): 30–34.
[6] J. B. MacQueen (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297.
[7] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf (2004). Learning with local and global consistency. Advances in Neural Information Processing Systems, 16:321-328.
[8] Shayan A. Tabrizia, Azadeh Shakerya, Masoud Asadpoura, Maziar Abbasia and Mohammad Ali Tavallaiea (2013). Personalized pagerank clustering: A graph clustering algorithm based on random walks. Physica A: Statistical Mechanics and its Applications. Volume 392, Issue 22, 15 November 2013, Pages 5772–5785
[9] Fan Chung and Alexander Tsiatas (2010). Finding and visualizing graph clusters using PageRank optimization. In: Proceedings of the Workshop on Algorithms and Models for the Web Graph (WAW). Stanford, California, pp. 86–97.
[10] Fan Chung and Alexander Tsiatas (2012). Finding and visualizing graph clusters using PageRank optimization. In: Internet Mathematics 8.1–2, pp. 46–72.
[11] O. Kurland and L. Lee (2005). PageRank without hyperlinks:Structural re-ranking using links induced by language models.In Proceedings of SIGIR, pages 306–313.
[12] O. Kurland and L. Lee (2006). Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, Seattle, Washington, USA.
[13] Wei Li and Gareth G.F. Jones (2013). Enhanced Information Retrieval by Exploiting Recommender Techniques in Cluster-Based Link Analysis. ICTIR '13 Pages 12.
[14] Konstantin Avrachenkov , Vladimir Dobrynin , Danil Nemirovsky , Son Kim Pham and Elena Smirnova (2008). Pagerank based clustering of hypertext document collections, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, Singapore, Singapore
[15] 中央研究院資訊科學所詞庫小組中文斷詞系統(http://ckipsvr.iis.sinica.edu.tw/)
[16] K. Andrews, W. Kienreich, V. Sabol, J. Becker, G. Droschl, F. Kappe, M. Granitzer, P. Auer and K. Tochtermann (2002). The Infosky Visual Explorer: Exploiting Hierarchical Structure and Document Similarities. Information Visualization, vol. 1, nos. 3/4, pp. 166-181.
[17] N. Cao, J. Sun, Y.-R. Lin, D. Gotz, S. Liu, and H. Qu (2010). FacetAtlas: Multifaceted Visualization for Rich Text Corpora. IEEE Transactions on Visualization and Computer Graphics 16(6):1172 – 1181.
[18] D. Mladenic and M. Grobelnik (2004). Visualizing Very Large Graphs Using Clustering Neighborhoods. In: Local Pattern Detection, Dagstuhl Castle, Germany, April 12–16.
[19] Stefan Hachul and Michael Junger (2004). Drawing large graphs with a potential-field-based multilevel algorithm, Proceedings of the 12th international conference on Graph Drawing, September 29-October 02, 2004, New York, NY