Basic Search / Detailed Display

Author: 張婷芳
Ting - Fang Chang
Thesis Title: 結合本體論與詞彙鏈群聚之文件分群研究
A study on Combining Ontology and Lexical Chain Clusters for Document Clustering
Advisor: 徐俊傑
Chiun-chieh Hsu
Committee: 賴源正
Yuan-cheng Lai
洪政煌
Cheng-huang Hung
Degree: 碩士
Master
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2007
Graduation Academic Year: 95
Language: 中文
Pages: 62
Keywords (in Chinese): 文件分群本體論詞彙鏈詞彙鏈群聚
Keywords (in other languages): Document Clustering, Ontology, Lexical Chain, Lexical Chain Clusters
Reference times: Clicks: 165Downloads: 8
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 隨著網際網路的普及化,可以獲得的資訊越來越多元化,然而也伴隨著資訊過量的問題產生。因此如何在網際網路上快速且正確的取得所需的資訊,已經成為一個重要的議題。而如何協助使用者快速獲得真正需要的資訊正是文件分群系統的目的,也是本研究的動機。
    詞彙鏈是常用於分析文件的方法,其藉由辭典建立,但準確率不是十分理想,本研究認為這個缺點可使用特定領域的知識本體論來彌補。因此,本論文提出”結合本體論與詞彙鏈群聚”的文件分群方法。使用本體論進行候選詞彙挑選以及詞彙鏈建立,以詞彙鏈分割演算法進行詞彙鏈群聚。透過詞彙鏈群聚計算鏈結密度向量進行文件分群,並使用鏈結密度向量衡量文件之間的相似度值,提供使用者瀏覽所需文件與相近的文件。
    根據實驗的分析,證實使用本體論與詞彙鏈群聚能有效的提升文件分群的準確率。使用詞彙鏈群聚的分群方式與K-means、Bisecting k-means演算法比較也有顯著的差異。


    With the popularity of the World Wide Web, more and more information is accessible on the Internet. The explosion of information leads to the information overload problem. How to find out users’ interested information efficiently and accurately has become an urgent issue. Document clustering is a direction of retrieving information effectively.
    Lexical chain that created by dictionary is a method in text data analysis. The defect of lexical chain is the low precision. This thesis proposes a document clustering method combining ontology and lexical chain clusters to compensate the disadvantage of only using lexical chain. First we utilize the domain ontology to extract candidate words and express documents with lexical chains. The lexical chains are clustered based on a lexical chain division algorithm, where each document is assigned to one cluster by computing the similarity between the document and chain density vectors. Finally, we can effectively and efficiently cluster documents.
    Compare with the k-means and bisecting k-means algorithms, the experimental results reveal that the proposed method can improve significantly the precision of document clustering.

    中文摘要 I 英文摘要 II 誌謝 III 目錄 IV 圖索引 VII 表索引 VIII 第一章、緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的及方法 2 1.4 論文架構 3 第二章、文獻探討 4 2.1 詞彙鏈的定義 4 2.2 建立詞彙鏈之演算法及相關研究文獻 5 2.2.1 Greedy鏈結演算法 5 2.2.2 Non-Greedy鏈結演算法 9 2.3 辭典的介紹 13 2.3.1 Roget’s 13 2.3.2 WordNet 14 2.3.3 知網(HowNet) 15 2.3.4 同義詞詞林 18 2.5 本體論 21 2.6 文件分群技術 22 2.6.1 階層式分群(Hierarchical Clustering) 23 2.6.2 分割式分群(Partitioning Clustering) 24 第三章、結合本體論與詞彙鏈群聚之文件分群 26 3.1 系統架構 26 3.2 文件前處理 27 3.2.1 中文斷詞 28 3.2.2 複合詞偵測 29 3.2.3 詞性過濾 30 3.3 以本體論建構詞彙鏈 31 3.3.1 領域本體論 31 3.3.2 文件標題與候選詞彙選取 33 3.3.3 詞彙鏈建立 33 3.4 詞彙鏈群聚 35 3.4.1 以圖形方式呈現詞彙鏈之間的關係 35 3.4.2 使用詞彙鏈分割進行詞彙鏈群聚 36 3.5 文件分群與資訊呈現 38 3.5.1 結合鏈結密度之文件分群 38 3.5.2 資訊呈現 40 第四章、實驗結果與分析 43 4.1 資料集與實驗評估方法 43 4.1.1 資料集 43 4.1.2 文件分群準確率評量 44 4.2 文件分群之結果分析 46 4.2.1 文件標題權重(參數β)設定對於分群準確率之影響 46 4.2.2使用本體論過濾候選詞彙對分群準確率之影響 48 4.2.3 建立詞彙鏈考慮的詞彙關係對分群準確率之影響 49 4.2.4 移除文件標題權重與詞彙鏈群聚對分群準確率之影響 51 4.2.5 結合本體論和詞彙鏈群聚與相異分群方法之分群準確率比較 52 4.3 資訊呈現之結果分析 55 第五章、結論與未來研究 57 5.1 結論 57 5.2 未來研究方向 58 參考文獻 59

    [1] R. Al-Halimi and R. Kazman, “Temporal Indexing Through Lexical Chaining”, The MIT Press, Cambridge, MA, pp.33-352, 1997.
    [2] R. Barzilay and M. Elhadad, “Using Lexical Chain for Text Summarization”, The ACL Workshop on Intelligent Scalable Text Summarization, pp.10-17, 1997.
    [3] M. Brunn, Y. Chali and C. J. Pinchak, “Text Summarization Using Lexical Chains”, Proceedings of the Document Understanding Conference, pp.135-140, 2001.
    [4] C. Buckley, “Implementation of the SMART Information Retrieval System”, Technical Report #TR85-686, Cornell University, 1985.
    [5] B. Chandrasekaran, J. R. Josephson, and V. R. Benjamins, “What Are Ontologies, and Why Do We Need Them?”, IEEE Intelligent Systems, pp.20-26, 1999.
    [6] Y. M. Chen, X. L. Wang and B. Q. Liu, “Multi-Document Summarization Based On Lexical Chains”, IEEE Proceedings of the Fourth International conference on Machine Learning and Cybernetics, pp.18-21, 2005.
    [7] D. Fensel and M. A. Musen, “The Semantic Web: A Brain for Humankind”, IEEE Intelligent Systems, vol.16, no.2, pp.24-25, March/April 2001.
    [8] J. Fraser, A. Tate, and M. Uschold, “The Enterprise Toolset - an Open Enterprise Architecture.”, Interoperability and Distributed Processing, pp.42-50, 1995.
    [9] M. Galley and K. McKeown, “Improving Word Sense Disambiguation in Lexical Chaining”, Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2003.
    [10] S. J. Green, “Building Hypertext Links By Computing Semantic Similarity”, IEEE Transactions on Knowledge and Data Engineering, vol.11, no.5, September/October 1999.
    [11] M. Gruninger and M. S. Fox, “The Logic of Enterprise Modeling”, Modelling and Methodologies for Reengineering the Enterprise Integration, pp.83-98, 1996.
    [12] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.
    [13] G. Hirst and D. St-Onge, “Lexical chains as Representations of Context for the Detection and Correction of Malapropisms”, The MIT Press, Cambridge, MA, pp.305-332, 1998.
    [14] B. Y. Kang, “A novel approach to semantic indexing based on concept”, Proceedings of the Association for Computational Linguistics Student Session, 2003.
    [15] D. L. Lee, H. Chuang and K. Seamons, “Document Rank and the Vector-Space Model”, Software IEEE, vol.14, no.4, pp.67-75, 1997.
    [16] S. Li, W. You, T. Li and H. Chen, “Lexical-chain and it’s Application in Text Filtering”, Proceedings of the International Conference on Information Technology: Coding and Computing(ITCC’04), vol.2, pp.288-292, 2004.
    [17] C. Y. I. Lin and C. S. Ho, “A Generic-Ontology-Based Approach for Requirement Analysis and its Application in Network Management Software”, Artificial Intelligence for Engineering Design, Analysis and Manufacturing, vol.13, no.1, pp.37-61, 1999.

    [18] Q. Liu and S. Li, “Word Similarity Computing Based on How-net”, The Association for Computational Linguistics and Chinese Language Processing, vol.7, no.2, pp.59-76, 2002.
    [19] A. Maedche and S. Stabb, “Ontology Learning for the Semantic Web”, IEEE Intelligent Systems, vol.16, no.2, pp.72-79, March/April 2001.
    [20] G. A. Miller, “WordNet: An On-line Lexical Database”, International Journal of Lexicography, vol.3, no.4, pp.235-312, 1990.
    [21] J. Morris and G. Hirst, “Lexical cohesion computed by thesaural relations as an indicator of the structure of text”, Computational Linguistics, vol.17, no.1, pp.21-48, 1991.
    [22] M. Okumura and T. Honda, “Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion”, Proceedings of the 15th International Conference on Computational Linguistics, pp.775-761, 1994.
    [23] H. G. Silber and K. F. McCoy, “Efficient text summarization using lexical chains”, International Conference on Intelligent User Interfaces, pp.252-255, 2000.
    [24] H. G. Silber and K. F. McCoy, “Efficiently computed lexical chains as an intermediate representation for automatic text summarization”, Computational Linguistics, vol.28, no.4, pp.487-496, 2002.
    [25] S. Staab and A. Maedche, “Knowledge Portals Ontologies at Work”, AI Magazine, vol.22, no.2, pp.63-75, Summer 2001.
    [26] M. A. Stairmand and W. J. Black, “Conceptual and Contextual Indexing using WordNet-derived Lexical Chains”, Proceedings of BCS IRSG Colloquium on Information Retrieval, pp.47-65, 1997.
    [27] N. Stokes, “Applications of Lexical Cohesion Analysis in the Topic Detection and Tracking Domain”, PhD Thesis, Department of Computer Science, National University of Ireland, Dublin, 2004.
    [28] M. Uschold and M. Gruninger, “Ontologies: Principles Methods and Applications”, The Knowledge Engineering Review, vol.11, no.2, pp.93-136, 1996.
    [29] Q. Wang, Y. Guan, X. L. Wang and Z. M. Xu, “Using Category-Based Semantic Field for Text Categorization”, Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, pp.18-21, August 2005.
    [30] S. Y. Yang and C. S. Ho, “Ontology-Supported User Models for Interface Agents”, Proceedings of the 4th Artificial Intelligence and Applications, pp.248-253, 1999.
    [31] 朱毓君,“以本體論強化網路FAQ系統之解答整合能力”,國立台灣科技大學電子工程系未出版碩士論文,2001。
    [32] 李怡箴,“從客戶服務中心觀點建構使用者導向動態常見問題集”,國立台灣大學資訊管理研究所未出版碩士論文,2002。
    [33] 陳莉君,“線上個人化參考文獻管理系統”,國立交通大學資訊科學系未出版碩士論文,2002。
    [34] 黃居仁,“語意網、詞網與知識本體:淺談未來網路上的知識運籌”,佛教圖書館館訊,第33期,2003。
    [35] 董振東,“知網”,http://www.keenage.com/。
    [36] 蔡坤修,“以動態式詞分群為基礎之文件分群研究”,國立台灣科技大學資訊管理研究所未出版碩士論文,2003。

    無法下載圖示 Full text public date 2012/06/28 (Intranet public)
    Full text public date This full text is not authorized to be published. (Internet public)
    Full text public date This full text is not authorized to be published. (National library)
    QR CODE