簡易檢索 / 詳目顯示

研究生: 廖若筑
Jo-Chu Liao
論文名稱: 結合本體論與封閉高頻項目集之階層式文件分群法
A Method of Combining Ontology and Closed Frequent Itemsets for Hierarchical Document Clustering
指導教授: 徐俊傑
Chiun-Chieh Hsu
口試委員: 黃世禎
Shih-Chen Huang
張錫正
Hsi-Cheng Chang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 86
中文關鍵詞: FIHC關聯規則探勘分群演算法本體論封閉高頻項目集
外文關鍵詞: FIHC, Documents clustering, Closed frequent itemsets
相關次數: 點閱:180下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於資訊科技的進步以及網際網路之普及,使得生活周遭充斥著大量的資料,
    同時也伴隨著資訊過量之問題。如何快速且正確的取得所需之資訊,已經成為一
    個重要且棘手之議題。文件探勘之目的為將看似無用之資料轉換成富有價值之知
    識,而如何透過文件探勘技術有效的協助使用者獲取感興趣之資訊正是本研究之
    動機。自動化文件分群技術為文件探勘中之熱門研究主題,透過分群演算法,將
    文件依其主題分群,將相似之文件歸類於同一個群集中,方便使用者瀏覽與蒐集
    資訊。然而,傳統之分群技術應用於文件探勘時卻必須克服多項困難,如文件集
    之高維度與高資料量,以及分群結果不易閱讀等問題。
    為了解決傳統分群法之不足,本研究改良因應文件探勘需求而產生之FIHC
    (Frequent Itemset-based Hierarchical Clustering)演算法,FIHC 使用關聯
    規則探勘出之高頻項目集做為分群之依據,其想法為屬於同一個主題之文章會共
    享許多關鍵字。因此,將高頻項目集視為被共享之關鍵字,而分群之目的即為找
    出共享此類關鍵字之文章。本研究進一步將FIHC 演算法結合本體論以解決一義
    多詞之問題與挖掘出潛在於文章中之概念,以提升分群之準確率。此外本研究使
    用封閉高頻項目集取代FIHC 演算法使用之高頻項目集,進而提升演算法之效率。
    經實驗證實,本研究之方法之分群結果較多數分群法更為精確。


    Due to the advance of science and technology and the popularity of the Internet, the explosion of information causes the information overload problem. In order to solve these problems, text mining becomes more and more important, and clustering is a hot topic in text mining. However, many document clustering methods are modifications of traditional clustering algorithms that were originally designed for relational database; these algorithms become impractical in real-world document clustering which requires special handling for high dimensionality, high volume, and ease of browsing.
    FIHC is a hierarchical clustering method developed for document clustering, the intuition of FIHC is that there exist some common words for each cluster. FIHC use such words to cluster documents and build hierarchical topic tree. In this thesis, we combine FIHC algorithm with ontology to solve the semantic problem and mine the meaning behind the words in documents. Furthermore, we use the closed frequent itemsets instead of only use frequent itemsets, which increases efficiency and scalability. The experimental results show that our method is more accurate than those of well-known document clustering algorithms.

    摘要 Abstract 誌謝 目錄 圖目錄 表目錄 第一章 緒論 1 1.1 研究背景與動機 1.2 研究目的與方法 1.3 論文架構 第二章 文獻探討 2.1 本體論 2.1.1 本體論之概念 2.1.2 本體論之建構方法 2.2 關聯規則探勘演算法 2.2.1 關聯規則探勘之種類 2.2.2 Apriori 演算法 2.2.3 FP-Growth 演算法 2.2.4 封閉高頻項目集與最大高頻項目集 2.3 分群演算法 2.3.1 階層式分群演算法 (Hierarchical Clustering) 2.3.2 分割式分群演算法 (Partitioning Clustering) 第三章 結合本體論與封閉高頻項目集 之階層式文件分群法 3.1 系統架構 3.2 文章前處理 3.2.1 中文斷詞 3.2.2 詞性過濾與去除stop words 3.3 同義字取代與概念詞彙新增 3.3.1 同義字取代 3.3.2 概念詞彙新增 3.4 關聯規則探勘 3.4.1 全域高頻項目集(Global frequent itemset) 3.4.2 文件向量與特徵向量 3.4.3 產生全域高頻項目集 3.4.4 產生封閉高頻項目集 3.5 文件分群 3.5.1 建構初始群集 3.5.2 消除群集間之重疊現象 3.6 文件分群結果之呈現 第四章 實驗結果與分析 4.1 資料集 4.2 文件分群成果之評估 4.2.1 彙整正確分群結果之評估法 4.2.2 以F-measure為基礎之評估法 4.3 文件群集之成果與分析 4.3.1 分群結果展示 4.3.2 群集樹展示 4.3.3 精確度之評估 4.4 本研究方法與FIHC之比較 4.4.1 群集樹(Cluster tree)之評估 4.4.2 分群準確度之評估 4.5 本研究方法與著名分群法之比較 第五章 結論與未來發展 5.1 結論與貢獻 5.2 未來發展方向 參考文獻

    [1] C. Aggarwal, S. Gates, and P. Yu. On the merits of building categorization systems by supervised clustering. International Conference on Knowledge Discovery and Data Mining, pp. 352–356, San Diego, US, 1999.
    [2] R. Agrawal, C. Aggarwal, and V. V. V. Prasad. Depth-first generation of large itemsets for association rules. IBM Technical Report, RC21538, July 1999.
    [3] R. Agrawal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. In Journal of Parallel and Distributed Computing, 61(3): pp. 350–371, 2001.
    [4] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. International Conference on Management of Data (SIGMOD98), pp. 94–105, 1998.
    [5] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. ACM SIGMOD International Conference on Management of Data (SIGMOD93), pp. 207–216, Washington, D.C., May 1993.
    [6] R. Agrawal and R. Srikant. Fast algorithm for mining association rules. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Int. Conf. Very Large Data Bases, VLDB, pp. 487–499. Morgan Kaufmann, 1994.
    [7] R. Agrawal and R. Srikant. Mining sequential patterns. Int. Conf. Data Engineering, pp. 3–14, Taipei, Taiwan, March 1995.
    [8] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In Proceeding of the 2002 ACM SIGKDD international conference on knowledge discovery in databases (KDD’02), pp. 436–442, 2002.
    [9] H. Borko and M. Bernick. Automatic document classication. Journal of the ACM, 10: pp. 151–162, 1963.
    [10] S. Chakrabarti. Data mining for hypertext: A tutorial survey. Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 1: pp. 1–11, 2000.
    [11] B. Chandrasekaran, J. R. Josephson, and V. R. Benjamins, What Are Ontologies, and Why Do We Need Them, IEEE Intelligent Systems, pp.20-26, 1999.
    [12] M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. Symposium on Theory Of Computing STOC, pp. 626–635, 1997.
    [13] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329, 1992.
    [14] R. C. Dubes and A. K. Jain. Algorithms for Clustering Data. Prentice Hall College Div, Englewood Cliffs, NJ, March 1998.
    [15] A. El-Hamdouchi and P. Willet. Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal, 32(3), 1989.
    [16] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. Int. Conf. on Knowledge Discovery and Data Mining (KDD 96), pp. 226–231, Portland, Oregon, August 1996. AAAI Press.
    [17] D. Fensel and M. A. Musen, The Semantic Web: A Brain for Humankind, IEEE Intelligent Systems, vol.16, no.2, pp.24-25, March/April 2001.
    [18] Benjamin C. M. Fung, Ke Wang, and Martin Ester. Hierarchical document clustering using frequent itemsets. International Conference on Data Mining. 2003.
    [19] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. IEEE Symposium on Foundations of Computer Science,
    pp. 359–366, 2000.
    [20] E. H. Han, B. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Webace: a web agent for document categorization and exploration. The second international conference on Autonomous agents, pp. 408–415. ACM Press, 1998.
    [21] J. Han and M. Kimber. Data Mining: Concepts and Techniques. Morgan-Kaufmann, August 2000.
    [22] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceeding of the 2000 ACM-SIGMOD international conference on management of data, Dallas, TX, pp. 1–12, 2000.
    [23] J. Hipp, U. Guntzer, and G. Nakhaeizadeh. Algorithms for association rule mining - a general survey and comparison. SIGKDD Explorations, 2(1): pp. 58–64, July 2000.
    [24] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, March 1990.
    [25] Kosala and Blockeel. Web mining research: A survey. Newsletter of the Special Interest Group SIG on Knowledge Discovery & Data Mining, 2, pp. 1-15, 2000.
    [26] B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD, pp. 16–22, San Diego, CA, USA, 1999.
    [27] C. Y. I. Lin and C. S. Ho, A Generic-Ontology-Based Approach for Requirement Analysis and its Application in Network Management Software, Artificial Intelligence for Engineering Design, Analysis and Manufacturing, vol.13, no.1, pp. 37-61, 1999.
    [28] A. Maedche and S. Stabb, “Ontology Learning for the Semantic Web”, IEEE Intelligent Systems, vol.16, no.2, pp. 72-79, March 2001.
    [29] K. Ross and D. Srivastava. Fast computation of sparse datacubes. In M. Jarke, M. Carey, K. Dittrich, F. Lochovsky, P. Loucopoulos, and M. Jeusfeld, editors, International Conference on Very Large Data Bases (VLDB97), pp. 116–125, Athens, Greece, August 1997.
    [30] S. Staab and A. Maedche, Knowledge Portals Ontologies at Work, AI Magazine, vol.22, no.2, pp.63-75, Summer 2001.
    [31] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. KDD Workshop on Text Mining, 2000.
    [32] M. Uschold and M. Gruninger, Ontologies: Principles Methods and Applications, The Knowledge Engineering Review, vol.11, no.2, pp.93-136, 1996.
    [33] C. J. van Rijsbergen. Information Retrieval. Dept. of Computer Science, University of Glasgow, Butterworth, London, 2 edition, 1979.
    [34] K. Wang, C. Xu, and B. Liu. Clustering transactions using large items. In Proc. of CIKM, pp. 483–490, 1999. BIBLIOGRAPHY 63
    [35] K. Wang, S. Zhou, and Y He. Hierarchical classification of real life documents. International Conference on Data Mining, Chicago, US, 2001.
    [36] S. Y. Yang and C. S. Ho, Ontology-Supported User Models for Interface Agents, The 4th Artificial Intelligence and Applications, pp.248-253, 1999.
    [37] O. Zamir, O. Etzioni, O. Madani, and R. M. Karp. Fast and intuitive clustering of web documents. Int. Conf. on Knowledge Discovery and Data Mining (KDD)’97, pp. 287–290, 1997.
    [38] 張婷芳,“結合本體論與詞彙鏈群聚之文件分群研究”,國立台灣科技大學資訊管理系碩士論文,2007
    [39] 朱毓君,“以本體論強化網路FAQ系統之解答整合能力”,國立台灣科技大學電子工程系未出版碩士論文,2001。
    [40] 林嘉幃,“本體論導引式之知識文件搜尋系統”,國立成功大學資訊管理所碩士論文,2005
    [41] 李維平,吳澤民,王美淳,“利用共生詞彙特性發展一個二階段文件群集法”, Journal of Science and Engineering Technology, Vol. 3, No. 1, pp. 9-18 (2007)
    [42] 吳家威, 劉昭麟,“應用本體論設計與建置摘要系統”,民生電子研討會論文集, pp.41-46,台灣新竹,Dec. 2002。

    無法下載圖示 全文公開日期 2017/06/26 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE