簡易檢索 / 詳目顯示

研究生: 魏莉斐
Li-Fei Wei
論文名稱: 有趣性度量結合詞彙權重之文件分類研究
A Study on Text Categorization Using Combination of Interest Measure and Term Weight
指導教授: 徐俊傑
Chiun-Chieh Hsu
口試委員: 陳正綱
Cheng-Kang Chen
黃世禎
Sun-Jen Huang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 61
中文關鍵詞: 文件分類相關性關聯規則詞彙
外文關鍵詞: Text Categorization, Correlation, Association Rule, Term
相關次數: 點閱:241下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於網際網路的興起,到處都可以看到大量的電子文件。因為文件分類可以使人們輕鬆地處理文件,因此最近幾年吸引許多學者對文件分類的注意與研究。
    在資料探勘領域中,關聯法則是其中一個重要的研究方向,大部分的關聯法則研究都是著重在尋找正向關聯的關聯規則,然而,許多研究指出,負向關聯的關聯式規則與正向關聯的關聯式規則是一樣重要的。
    因此,本研究著重在於利用有趣度量來區分出正負關聯的關聯規則,然而,僅以有趣度量來區分是不夠。在文獻中有些研究利用相關係數大小來判斷規則的強度,而相關係數僅考慮到詞彙間的共同出現或不出現的次數,並無考慮到詞彙的權重,而詞彙在類別中出現頻繁與否是有一定的重要程度,因此,想再將有趣度量結合詞彙之權重來加強正負關聯的關聯規則強度,以正負關聯強度來篩選出所要的關聯規則,使得這些關聯規則是更有意義、更具代表性的代表各類別的分類準則,以提升文件分類的效益,更精準地正確分類文件。


    Due to the popularity of World Wide Web, there exist a large amount of digital documensts on the Internet. Because text categorization can make it more easily to deal with these documensts, it attracts many researchers to study the text categorization problem.
    In data mining, exploration of association rules is an important research issue. Most association rule researches focus on finding positive association rules. However, many studies point out that negative association rules are as important as positive association rules.
    Therefore, in this thesis, we will find out both positive and negative association rules. Although interest measure is a commonly-used measure for text categorization, we find that it is not enough to use interest measure only for text categorization. Some researches use correlation coefficient to judge the strength of a rule, but correlation coefficient only considers absence or presence between terms, not the weight of terms. Besides, it is important to consider term frequencies in categorization. Hence, we would like to combine interest and term weight to enhance the discriminative power of positive and negative association rules. It will be used to filter association rules to make these rules more meaningful and more representative for the classification criterion of a category. Therefore, the categorization results can be improved and new documents can be classified correctly.

    中文論文 I 英文論文 II 誌謝 III 圖索引 VI 表索引 VII 第一章、緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的及方法 2 1.4 論文架構 3 第二章、文獻探討 4 2.1 關聯規則 4 2.2 文件分類 11 2.3 相關性關聯規則 14 2.4 詞彙權重方法 20 第三章、研究方法 24 3.1 系統架構 24 3.2 文件前處理(preprocessing) 26 3.2.1 中文斷詞(Segmentation) 26 3.2.2人名辨識(Name-Entities Identification) 27 3.2.3複合詞偵測(Compound-Words Detection) 29 3.2.4 詞彙詞性過濾(The part of speech filtering) 29 3.2.5 詞彙頻率過濾(Term frequency filtering) 29 3.3 關聯規則探勘(Association rules mining) 30 3.3.1 Apriori_IW(Apriori with Interest and weight) Algorithm 30 3.3.2 有趣性度量(Interest measure) 36 3.3.3 詞彙權重(Term weight) 38 3.3.4 相關性強度(Correlation Strength) 40 3.4 文件分類(Text Categorization) 42 第四章、實驗結果與分析 45 4.1 文件資料集 45 4.2 分類結果評估方式 47 4.3 支持度及相關性強度對文件分類之影響 47 4.3.1 支持度對文件分類之影響 47 4.3.2 相關性強度對文件分類之影響 49 4.4 文件分類結果之正確率 53 第五章、結論與未來研究方向 55 5.1 結論 55 5.2 未來研究方向 56 參考文獻 58

    [1] R. Agrawal, and R. Srikant, “Fast Algorithm for Mining Association Rules,” Proceedings of the 20th International Conference on Very Large Databases, pp. 487-499, 1994.
    [2] M. L. Antonie, and O. R. Zaiane, “Text Document Categorization by Term Association,” Proceedings of International Conference on Data Mining, pp. 19-26, 2002.
    [3] M. L. Antonie, and O. R. Zaiane, “Mining Positive and Negative Association Rules: An Approach for Confined Rules,” Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 27-29, 2004.
    [4] M. L. Antonie, and O. R. Zaiane, “An Associative Classifier Based on Positive and Negative Rules,” Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 64-69, 2004.
    [5] S. Brin, R. Motwani, and C. Silverstein, “Beyond Market Basket: Generalizing Association Rules to Correlations,” Proceedings of the International Conference on Management of Data, pp. 255-276, 1997.
    [6] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data,” Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pp. 255-264, 1997.
    [7] P. Cabena, P. Hadjnian, R. Stadler, J. Verhees, and A. Zanasi, Discovering DataMining from Concept to Implementation, New Jersey:Pretice Hall, 1997.
    [8] X. Y. Chen, Y. Chen, L. Wang, and Y. F. Hu, “Text Categorization Based on Frequent Patterns with Term Frequency,” Proceedings of the 3rd International Conference on Machine Learning and Cybernetics, pp. 1610-1615, 2004.
    [9] H. H. Chen, and J. C. Lee, “Identification and Classification of Proper Nouns in Chinese Texts”, Proceedings of 16th International Conference on Computational Linguistics, pp. 222-229, 1996.
    [10] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, Lawrence Erlbaum, New Jersey, 1988.
    [11] X. Fu, J. Budzik, and K. J. Hammond, “Mining Navigation History for Recommendation,” Proceedings of the 5th International Conference on Intelligent User Interfaces, pp. 106-112, 2000.
    [12] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1-12, 2000.
    [13] C. H. Lee, Y. H. Kim, and P. K. Rhee, “Web Personalization Expert with Combining Collaborative Filtering and Association Rule Mining Technique,” Expert Systems with Applications, vol. 21, no. 3, pp. 131-137, 2001.
    [14] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 80-86, 1998.
    [15] Robert W. P. Luk, and K. L. Kwok, “A Comparison of Chinese Document Indexing Strategies and Retrieval Models,” Acm Transactions on Asian Language Information Prcessing, vol. 1, no.3, pp. 225-268, 2002.
    [16] J. S. Park, M. S. Chen, and P. S. Yu, “An Effective Hash-Based Algorithm for Mining Association Rules,” Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pp. 175-186, 1995.
    [17] C. J. van Risberrgen, Information Retrieval, The URL of this paper can be found at http://www.dcs.gla.ac.uk/Keith/Preface.html.
    [18] A. Savasere, E. Omiecinski, and S. Navathe, “Mining for Strong Negative Associations in a Large Dataset of Customer Transactions,” Proceedings of the 14th International Conference on Data Engineering, pp. 494-502, 1998.
    [19] W. G. Teng, M. J. Hsieh, and M. S. Chen, “On the Mining of Substitution Rules for Statistically Dependent Items,” Proceedings of the 2nd International Conference on Data Mining, pp. 442-449, 2002.
    [20] X. Wu, C. Zhang, and S. Zhang, “Mining Both Positive and Negative Association Rules,” Proceedings of the 19th International Conference on Machine Learning Table of Contents, pp. 658-665, 2002.
    [21] O. R. Zaiane, and M. L. Antonie, “Classifying Text Documents by Associating Terms with Text Categories,” Proceedings of the 13th Australasian Database Conference, vol. 5, pp. 215-222, 2002.
    [22] Q. R. Zhang, L. Zhang, S. B. Dong, and J. H. Tan, “Document Indexing in Text Categorization,” Proceedings of the 4th International Conference on Machine Learning and Cybernetics, pp. 3792-3796, 2005.
    [23] H. P. Zipf, Human Behavior and the Principle of Least Effort, Addison-Wesley, Cambridge, Massachusetts, 1994.
    [24] 中華人網站,http://www.greatchinese.com/
    [25] 中央研究院中文詞知識庫小組(CKIP),http://ckipsvr.iis.sinica.edu.tw/
    [26] 林高弘,“有趣性關聯法則之線上調適性挖掘法的研究”,南華大學資訊管理研究所碩士論文,2003。
    [27] 邱詩佩,“以事件特徵為基礎的階層式新聞偵測系統”,台灣科技大學資訊管理研究所碩士論文,2005。
    [28] 殷欣靖,“以文件為基礎的資訊取系統”,台灣科技大學資訊管理研究所碩士論文,2001。
    [29] 楊凱翔,“利用正相關法則改善分類精準度之研究”,中興資訊科學研究所碩士論文,2006。
    [30] 劉家銘,“利用負相關線上挖掘關聯式規則”,中興資訊科學研究所碩士論文,2001。
    [31] 蔡坤修,“以動態式詞分群為基礎之文件分群研究”,台灣科技大學資料管理研究所碩士論文,2003。
    [32] 鄭為倫,“單分類在文件多類別分類上之研究”,銘傳大學資訊管理學系碩士論文,2005。
    [33] 簡禎富,決策分析與管理:全面決策品質提升之架構與方法,雙葉書廊,2005。
    [34] 顏月珠,現代統計學,三民書局,1994。

    無法下載圖示 全文公開日期 2012/07/03 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE