簡易檢索 / 詳目顯示

研究生: 劉易昇
YI-SHENG LOU
論文名稱: 基於PageRank之文件分群與文件視覺化方法研究
Document Clustering and Visualization of Documents Based on PageRank
指導教授: 林伯慎
Bor-Shen Lin
口試委員: 古鴻炎
Hung- yan Gu
楊傳凱
Chuan-Kai Yang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2014
畢業學年度: 102
語文別: 中文
論文頁數: 118
中文關鍵詞: PageRank分群文件分群文件視覺化
外文關鍵詞: PageRank Based Clustering, Document Clustering, Visualization of Documents
相關次數: 點閱:367下載:10
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究提出一種基於PageRank的文件分群與文件視覺化方法,可以用來分析文件集,讓人們可以更快速理解文件集的主要特徵與大致內容;並可以透過放大(Zoom In)的概念,更深入了解某個群集再做細部的文件分群,以找出文件集中的重要脈絡。我們也提出了兩種評估指標:緊密度與連結度,來度量分群效果。緊密度是衡量群集內兩兩資料點的平均相似度,而連結度是衡量群集內資料之最小生成樹的平均連結度。連結度的指標有助於找出在空間中延伸較大範圍但彼此連結的資料群集,這類群集的緊密度通常稍低。我們對不同類型的文件集進行文件分群實驗,發現PageRank文件分群方法可以比K-Means文件分群方法獲得更好的連結度效能。最後,將PageRank分群方法所得到群集中含有資料點的權重、資料點的連結等脈絡資訊,幫助文件分群視覺化。


    In this paper, we proposes a document clustering and visualization scheme with PageRank-based agglomerative clustering. This approach can be used to analyze document sets such that people may grasp the main topics or issues within a document set quickly. In addition, two metrics, including compactness and connectivity, are defined to measure the quality of document clusters. Experimental results show that PageRank-based approach outperforms k-means-based approach on both metrics by aggregating data strictly and eliminating outliers effectively. This scheme has been primarily tested on several document sets and satisfactory analysis results can be obtained. Visualization of 1,000 sport news based on this scheme was further given to show its applicability.

    第1章 緒論 1.1 研究動機 1.2 論文目的與成果簡介 1.3 論文組織與架構 第2章 文獻與技術背景 2.1 PageRank演算法 2.2 文件處理相關技術介紹 2.2.1 向量空間模型 2.2.2 PCA主成分分析方法 2.3 資料分群方法 2.4 PageRank用於資料分群之文獻介紹 2.5 PageRank用於文件處理之文獻介紹 2.6 文件視覺化相關文獻 2.7 本章摘要 第3章 PageRank分群 3.1 基於PageRank的分群方法 3.1.1 方法介紹 3.1.2 PageRank資料分群特色分析 3.2 評估方法介紹 3.2.1 緊密度 3.2.2 緊密度之實驗分析 3.2.3 連結度 3.2.4 連結度之實驗分析 3.2.5 連結度與緊密度的特性比較 3.3 本章摘要 第4章 PageRank文件分群方法 4.1 PageRank文件分群方法 4.1.1 文字處裡 4.1.2 PageRank文件分群說明 4.2 特徵維度對於分群之影響 4.2.1 PCA維度變異數實驗 4.2.2 PCA維度對於文件相似度分布的影響 4.2.3 餘弦相似度分佈對分群的影響 4.3 文件分群評估實驗 4.3.1 文件集(I)實驗 4.3.2 文件集(II)實驗 4.3.3 文件集(III)實驗 4.3.4 文件集(IV)實驗 4.3.5 文件集(V)實驗 4.4 本章摘要 第5章 PageRank文件分群階層視覺化 5.1 概念說明 5.2 PageRank文件分群視覺化方法 5.3 體育相關新聞分析 5.3.1 一千篇體育相關中文新聞文件集 5.3.2 針對棒球群集再作細部分群 5.3.3 針對第中華職棒再作細部分群 5.4 本章摘要 第6章 結論 參考文獻

    [1] S. Brin and L. Page (1998). The anatomy of a large-scale hypertextual Web searchengine. Computer Networks and ISDN Systems 30, 107–117.
    [2] G. Salton, A. Wong and C. S. Yang (1975). A Vector Space Model for Automatic Indexing. Commun. ACM, Vol. 18, pp.613-620.
    [3] R. Baeza-Yates and B. Ribeiro-Neto (1999). Modern Information Retrieval. Addison Wesley.
    [4] C. Combes and J. Azema (2013). Clustering using principal component analysis applied to autonomy–disability of elderly people. Decision Support Systems, vol. 55, pp. 578-586, 5.
    [5] R. Sibson (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal (British Computer Society) 16 (1): 30–34.
    [6] J. B. MacQueen (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297.
    [7] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf (2004). Learning with local and global consistency. Advances in Neural Information Processing Systems, 16:321-328.
    [8] Shayan A. Tabrizia, Azadeh Shakerya, Masoud Asadpoura, Maziar Abbasia and Mohammad Ali Tavallaiea (2013). Personalized pagerank clustering: A graph clustering algorithm based on random walks. Physica A: Statistical Mechanics and its Applications. Volume 392, Issue 22, 15 November 2013, Pages 5772–5785
    [9] Fan Chung and Alexander Tsiatas (2010). Finding and visualizing graph clusters using PageRank optimization. In: Proceedings of the Workshop on Algorithms and Models for the Web Graph (WAW). Stanford, California, pp. 86–97.
    [10] Fan Chung and Alexander Tsiatas (2012). Finding and visualizing graph clusters using PageRank optimization. In: Internet Mathematics 8.1–2, pp. 46–72.
    [11] O. Kurland and L. Lee (2005). PageRank without hyperlinks:Structural re-ranking using links induced by language models.In Proceedings of SIGIR, pages 306–313.
    [12] O. Kurland and L. Lee (2006). Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, Seattle, Washington, USA.
    [13] Wei Li and Gareth G.F. Jones (2013). Enhanced Information Retrieval by Exploiting Recommender Techniques in Cluster-Based Link Analysis. ICTIR '13 Pages 12.
    [14] Konstantin Avrachenkov , Vladimir Dobrynin , Danil Nemirovsky , Son Kim Pham and Elena Smirnova (2008). Pagerank based clustering of hypertext document collections, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, Singapore, Singapore
    [15] 中央研究院資訊科學所詞庫小組中文斷詞系統(http://ckipsvr.iis.sinica.edu.tw/)
    [16] K. Andrews, W. Kienreich, V. Sabol, J. Becker, G. Droschl, F. Kappe, M. Granitzer, P. Auer and K. Tochtermann (2002). The Infosky Visual Explorer: Exploiting Hierarchical Structure and Document Similarities. Information Visualization, vol. 1, nos. 3/4, pp. 166-181.
    [17] N. Cao, J. Sun, Y.-R. Lin, D. Gotz, S. Liu, and H. Qu (2010). FacetAtlas: Multifaceted Visualization for Rich Text Corpora. IEEE Transactions on Visualization and Computer Graphics 16(6):1172 – 1181.
    [18] D. Mladenic and M. Grobelnik (2004). Visualizing Very Large Graphs Using Clustering Neighborhoods. In: Local Pattern Detection, Dagstuhl Castle, Germany, April 12–16.
    [19] Stefan Hachul and Michael Junger (2004). Drawing large graphs with a potential-field-based multilevel algorithm, Proceedings of the 12th international conference on Graph Drawing, September 29-October 02, 2004, New York, NY

    QR CODE