簡易檢索 / 詳目顯示

研究生: 張錫正
Hsi-Cheng Chang
論文名稱: 使用主題詞彙群組進行文件分群之研究
On the Study of Using Topic Keyword Clusters for Document Clustering
指導教授: 徐俊傑
Chiun-Chieh Hsu
口試委員: 王有禮
none
蕭培墉
none
王建民
none
陳永昇
none
黃天佑
none
張瑞雄
none
學位類別: 博士
Doctor
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2005
畢業學年度: 93
語文別: 英文
論文頁數: 89
中文關鍵詞: 詞彙分群文件分群自然語言處理資訊檢索
外文關鍵詞: Keyword Clustering, Information retrieval, natural language processing, Document Clustering
相關次數: 點閱:318下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 網際網路迅速且完整發展的結果,提供了大量各類型的數位式資料於網路上。在這種數位式資料與資訊垂手可得的狀況下,同時造成了資訊過載的問題。如何在現今的網際網路上快速且正確地搜尋及擷取所要的資訊,已變成是一個非常重要且需要的課題。在加速資訊正確檢索與搜尋、資料挖掘(text mining)與知識發覺(knowledge discovery)等相關研究上,文件自動分類及分群已是這些研究最基本且必要的處理步驟,同時也引起廣泛的性趣與研究。文件自動分群己被廣泛且有效地應用於文字資訊的分析上。
    分群(clustering)意即以非監督式(unsupervised)方式,不依靠任何事前準備的訓練文件或由專家給定的類別標記資訊,而將資料自動分類成群。以往的文件自動分群演算法大多是以直接量測文件-文件之間的相似度為基礎而將對文件加以分群。由於文件分群(document clustering)是屬於一種高維度(high-dimension) 的資料分群問題。在一般情形下,一篇文章即含有數百或數仟個詞彙。在一中等大小的文件集中即可能包含數萬或數拾萬個不同的詞彙。如此龐大的詞彙數不僅容易使文件所論及的主題模糊了,同時也使得傳統文件分群演算法之運算複雜度會隨著文件數的增加而急速增加,這也使得傳統文件自動分群方法一直得不到另人滿意的分群效能,且其分群結果也易隨著文件集及分群演算法中所使用之臨界值(threshold)的設定之不同而有不同的分群結果。
    為了提高分群結果的準確度同時能有效解決傳統文件分群法運算複雜度過高的問題。本論文以不同於傳統文件分群法的思考模式與作法來進行文件分群。本論文將由先找出文件集中的主題詞彙群組再將文件集中的文件加以分群的分群方式來進行,並提出兩種不同的方法以解決兩個不同的非監督式文件分群問題。第一個方法將利用找出詞彙加權有向圖(weighted directed graphs)中的強連結單元(component)的方式來找出主題詞彙群組再將文件加以分群,以解決非監督式跨主題文件分群問題(unsupervised non-exclusive document clustering)。第二個分群法於一詞彙加權非有向圖中使用k nearest neighbor演算法及簡單的連接成份搜索法找出詞彙群組,再根據找出的主題詞彙群組將文件加以分群,以解決非監督式文件不跨主題的文件分群問題(unsupervised exclusive document clustering)。在此分群法中,另發展一「主題/事件偵測法」用於偵測出文件集中的主題/事件對(topic/event pairs),再用於萃取出更具鑑別力與意涵的關鍵詞彙來表示文件及進行詞彙分群,以減輕詞彙分群演算法的計算複雜度。在本論文中的詞彙分群過程中,由於僅使用較具意涵與鑑別度的關鍵詞彙,而且此關鍵詞彙數不會隨文件數快速增加而等比例快速增加,所以可以非常有效率地將文件加以分群且適用大文件集的分群上。與傳統的分群方法相比較,實驗結果顯示本論文所提之文件自動分群法在分群效果與效能上均優於傳統的分群方法。


    The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key issue in exploratory text data analysis. Clustering is the unsupervised classification of data items into groups without the need for any preparation of training data. In conventional document clustering methods the dominant approaches to this problem are based on measuring the similarity among the documents. However, the native feature space comprises the terms in documents, which may number tens or hundreds of thousands of terms for even a moderate collection of texts. The numerous features of the documents create confusion regarding the topics of the documents, and the computational complexities of the conventional document clustering methods increase rapidly with the number of the documents. Consequently, many conventional document clustering methods frequently perform poorly for large document collections, and the clustering results frequently differ depending on assumptions and contexts in different text document collection.
    This study aims at reducing the computational complexity of keyword comparison in conventional document-to-document similarity based clustering methods, while also increasing the clustering accuracy. Hence, reversing the notions and processes of the conventional document clustering methods, this study proposes two unsupervised document clustering methods based on automatic topic keyword clustering. The first method using weighted directed graphs for keyword clustering to resolve the problem of unsupervised non-exclusive document clustering. In this clustering method, the strongly connected components in the keyword digraph are explored heuristically and the documents are clustered with the identified keyword clusters. The second method using weighted undirected graphs for keyword clustering to resolve the unsupervised exclusive document clustering problem. In this clustering method, a topics/events detection scheme is developed and used to extract the most meaningful keywords for document representation and keyword clustering. This process makes the topic keywords more discriminative and concise than using simple feature filtering metrics only and, furthermore, lightens the computational cost. As a result, in the proposed clustering methods, only those discriminative and meaningful topic keywords are used and the topic keywords increase less than the number of documents. Hence, the proposed clustering methods can significantly reduce the computational costs, and simultaneously obtain a high clustering accuracy compared with those obtained by the other clustering methods. The proposed clustering methods are more suitable for clustering large document collections than conventional document clustering methods.

    中文摘要 I ABSTRACT III 誌 謝 V CONTENTS VI LIST OF TABLES VIII LIST OF FIGURES IX CHAPTER 1 Introduction 1 1.1 Background and Motivation 1 1.2 Problems of the Conventional Document Clustering Methods 4 1.2.1 Problems of the Centroid-based Clustering Methods 5 1.2.2 Problems of Agglomerative Clustering Methods 6 1.2.3 Problems of the Nearest Neighbor Clustering Algorithm 10 1.2.4 Problems of Document Clustering 13 1.3 Organization of the Dissertation 14 CHAPTER 2 Document Representation 17 2.1 The Vector Space Model 18 2.2 Similarity Measures 20 CHAPTER 3 Document Clustering Methods 23 3.1 Agglomerative Hierarchical Clustering Method 24 3.2 Partitional Clustering Methods 27 3.2.1 K-means Clustering Algorithm 28 3.2.2 Bisecting K-means Clustering Algorithm 30 3.3 Nearest-Neighbor Clustering Method 31 3.4 Clustering by Graph Theory 33 CHAPTER 4 Feature Filtering and Extraction Methods 37 4.1 Feature Filtering Methods 37 4.2 Grammar Properties of Mandarin Chinese 40 4.2.1 Action/Event vs. State/Situation in Chinese Verbs 40 4.2.2 Word-order Typology in Chinese 41 4.3 Properties of News Writing 43 4.4 Event Detection and Keyword Extraction 43 CHAPTER 5 Unsupervised Non-Exclusive Document Clustering 47 5.1 Semantic Correlation Measurement among Keywords 47 5.2 Keyword Clustering with Weighted Directed Graphs 50 5.2.1 Compute Support Rate of Keywords 53 5.2.2 Automatic Document Clustering 53 5.2.3 Experiments 55 CHAPTER 6 Unsupervised Exclusive Document Clustering 59 6.1 Document Pre-Processing 60 6.2 Phrase Identification 60 6.3 Feature Selection 61 6.4 Keyword Clustering with Weighted Undirected Graphs 63 6.5 Identifying Topic Keyword 64 6.6 Estimating Semantic Correlation among Topic Keywords 66 6.7 Constructing Topic Keyword Network 68 6.8 Topic Keyword Clustering 68 6.8.1 Identifying Candidate Topic Keywords 69 6.8.2 Finding Candidate Keyword Groups 70 6.8.3 Finding the Connected Component of Each Candidate Keyword Groups 70 6.8.4 Keyword Clusters Merging 70 6.8.5 Refining Keyword Clusters 72 6.9 Clustering Documents Based on Keyword Clusters 73 6.10 Experiments and Discussion 73 6.10.1 Experiment One 75 6.10.2 Experiment Two 76 6.11 Performance Analysis 77 CHAPTER 7 Conclusions and Future Work 81 REFERENCES 85

    [1] Aggarwal, C.C., Procopiuc, C., Wolf, J.L., Yu, P.S. and Park, J.S., “A framework for finding projected cluster in high dimensional spaces,” in proceedings of ACM SIGMOD conference on management of data, 1999.

    [2] Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and Park, J. S., “Fast algorithms for projected clustering,” in proceedings of ACM SIGMOD, vol. 28, no. 2, pp. 61-72, June 1999.

    [3] Aggarwal, C. C., Gates, S. C. and Yu, P. S., “On the merits of building categorization systems by supervised clustering,” in proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 352-356, 1999.

    [4] Beyer, K., Goldstein, J., Ramakrishnan, R. and Shaft, U., “When Is “Nearest Neighbor” Meaningful?,” in proceedings of the international conference on database theories, pp.217-235, 1999.

    [5] Boykin, S., and Merlino, A., “Machine learning of event segmentation for news on demand,” Communication ACM, vol. 43, no. 2, pp. 35–41, Feb. 2000.

    [6] Bradley, P. and Fayyad, U., “Refining initial points for k-means clustering,” in proceedings of the fifteenth international conference on machine learning ICML98, pp. 91-99, 1998.

    [7] Charles, N. Li and Sandra, A. Thompson, “Mandarin Chinese-a functional reference grammar,” The Crane Publishing Co., 1982.

    [8] Church, K. W., and Hanks, P., “Word association norms mutual information and lexicography,” Computational Linguistics, vol. 16, no. 1, pp. 22–29, 1991.

    [9] Clifton, C., Cooley, R., and Rennie, J., “TopCat: data mining for topic identification in a text corpus,” IEEE transactions on knowledge and data engineering, pp.2-17, 2003.

    [10] Cutting, D. R., Karger, D. R. et., “Scatter/Gather: A cluster-based approach to browsing large document collections,” in proceedings of ACM SIGIR, pp.318-329, 1992.

    [11] Cutting, D. R., Karger, D. R. et., “Constant interaction-time Scatter/Gather browsing of very large document collections,” in proceedings of ACM SIGIR, pp.126-134, 1993.

    [12] Faloutsos, C., Barber, R., Flickner, M., et al., “Efficient and effective querying by image content,” Journal of Intelligent Information Systems, vol.3, pp.231-262, 1994.

    [13] Feldman, R., and Hirsh, H., “Exploiting background information in knowledge discovery from text,” Journal of intelligent information systems, vol. 9, no. 1, pp. 83–97, July 1998.

    [14] Frakes, W. B. and Ricardo, B. Y., “Information retrieval data structures & algorithm,” Prentice Hall press, 1992.

    [15] Griffiths, A., Robinson, L. A., and Willett, P., “Hierarchic agglomerative clustering methods for automatic document classification,” Journal of Documentation, vol.40, no.3, pp.175-205, September 1984.

    [16] Han, E.H., Karypis, G., Kumar, V., and Mobasher, B., “Hypergraph based clustering in high-dimensional data sets: a summary of results,” Bulletin of the Technical Committee on Data Engineering, vol. 21 no. 1, 1998.

    [17] Hearst, M. A., and Pedersen, J. O., “Reexamining the cluster hypothesis: Scatter/Gather on retrieval results,” in proceedings of ACM SIGIR, pp.76–84, 1996.

    [18] Hinneburgy, A., Aggarwalz, C. C. and Keimy, D. A., “What is the nearest neighbor in high dimensional spaces?,” in proceedings of the 26th International Conference on Very Large Data Bases, pp.506-515, 2000.

    [19] Hsieh, S.M., Huang, S.J., Hsu, C.C., and Chang, H.C., “Personal documents recommendation system base on data mining techniques,” in proceeding of 2004 IEEE/WIC/ACM international joint conference on Web intelligence, September 2004.

    [20] Jain, A. J. and Dubes, R. C., “Algorithms for Clustering Data,” Prentice Hall, 1989.

    [21] Jain, A. K., Murty, M. N. and Flynn, P. J., “Data Clustering: A Review,” ACM computing surveys, vol. 31, no. 3, pp. 264-323, September 1999.

    [22] Karypis, G., Han, E. H., and Kumar, V., “CHAMELEON: a hierarchical clustering algorithm using dynamic modeling,” IEEE computer, pp. 68-75, 1999.

    [23] Kohavi, R. and Sommerfield, D., “Feature subset selection using the wrapper model: overfitting and dynamic search space topology,” in proceedings of the 1st international conferences on knowledge discovery and data mining, pp. 192–197, 1995.

    [24] Koller, D. and Sahami, M., “Hierarchically classifying documents using very few words,” in proceedings of the 14th International conference on machine learning, Nashville, Tennessee, pp. 170-178, July 1997.

    [25] Lai, Y. S., and Wu, C. H., “Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology,” ACM transactions on Asian language information processing, vol. 1, no. 1, pp.34-64 March 2002.

    [26] Larsen, B. and Aone, C., “Fast and effective text mining using linear-time document clustering,” in proceedings of the international conferences on knowledge discovery and data mining, San Diego, California, 1999.

    [27] Liu, J., and Chua, T. S., “Building semantic perception net for topic spotting,” in proceedings of ACL, 2001.

    [28] Lin, S. H., and Chen, M. C. etc., “ACIRD: Intelligent internet document organization and retrieval,” IEEE transactions on knowledge and data engineering, vol. 14, no. 3, pp.599-614, May/June 2002.

    [29] Manning, C. D. and Schutze, H., “Foundations of statistical natural language processing,” The MIT Press, 1999.

    [30] Park, J.S., Chen, M. S., and Yu, P. S., “Using a hash-based method with transaction trimming for mining association rules,” IEEE transactions on knowledge and data engineering, vol. 9, no. 5, pp. 813-825, 1997.

    [31] Patel, M., Bullinaria, J., and Levy, J., “Extracting Semantic Representations from Large Text Corpora,” in proceedings of the international conferences on NCPW '97, London, 1997.

    [32] Ricardo, B. Y., and Berthier, R. N., “Modern information retrieval,” Addison Wesley Longman Limited, 1999.

    [33] Rijsbergen, C. J., “Information retrieval,” Buttersworth, London, second edition.

    [34] Salton, G., “Automatic text processing: the transaction, analysis, and retrieval of information by computer,” Addison-Wesley, 1989.

    [35] Salton, G. and Arnold, J., “Introduction to modern information retrieval,” New York:McGraw-Hill, 1983.

    [36] Sebastiani, F., “Machine learning in automated text categorization,” ACM computing surveys, vol. 34, no. 1, pp. 1-47, March 2002.

    [37] Silverstein, C., Brin, S., and Motwani, R., “Beyond market baskets: generalizing association rules to dependence rules,” Data mining and knowledge discovery, vol. 2, no. 1, pp. 39–68, Jan. 1998.

    [38] Steinbach, M., Karypis, G. and Kumar, V., “A comparison of document clustering techniques,” Department of computer science and engineering university of Minnesota Technical report #00-034.

    [39] Wilson, D. and Martinez, T, “Improved heterogeneous distance functions,” Journal of artificial intelligence research, vol. 6, pp.1-34, 1997.

    [40] Yang, Y., “Noise reduction in a statistical approach to text categorization,” in proceedings of ACM SIGIR, pp.256-263, 1995.

    [41] Yang, Y., “Expert network: effective and efficient learning from human decisions in text categorization and retrieval,” in proceedings of ACM SIGIR, pp.13-22, 1994.

    [42] Yang, Y., and Chute, C. G., “An application of least squares fit mapping to text information retrieval,” in proceedings of ACM SIGIR, pp. 281-290, 1993.

    [43] Yang, Y. and Pedersen, J. O., “A comparative study on feature selection in text categorization,” in proceedings of ICML97, 1997.

    [44] Yang, Y., and Wilbur, J., “Using corpus statistics to remove redundant words in text categorization,” in proceedings of JASIS, 1996.

    [46] Zamir, O., Etzioni, O., Madani, O., Karp, R. M., “Fast and intuitive clustering of Web documents,” in proceedings of knowledge discovery and data mining KDD ’97, pp. 287-290, 1997.

    [46] Zipf, H. P., “Human behavior and the principle of least effort,” Addison-Wesley, Cambridge, Massachusetts, 1994.

    [47] 屈承熹, “A Concise Grammar of Mandarin Chinese”, 五南圖書出版公司, 1999.

    無法下載圖示 全文公開日期 2006/06/12 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 2007/06/02 (國家圖書館:臺灣博碩士論文系統)
    QR CODE