Basic Search / Detailed Display

Author: 陳威翰
Wei-han Chen
Thesis Title: 文件分類之輪循式字庫建立法
Round Robin Bag-of-words Generation for Text Classification
Advisor: 李育杰
Yuh-jye Lee
Committee: 鮑興國
Hsing-kuo Kenneth Pao
Tien-ruey Hsiang
Bi-ru Dai
Wen-han Huang
Degree: 碩士
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2007
Graduation Academic Year: 95
Language: 英文
Pages: 49
Keywords (in Chinese): 文件分類字庫法特徵選取特徵萃取支撐向量機潛在語義標引法
Keywords (in other languages): dimension reduction, bag-of-words representation, feature extraction, feature selection, latent semantic indexing, round robin selecting strategy, text classification
Reference times: Clicks: 270Downloads: 1
School Collection Retrieve National Library Collection Retrieve Error Report

在文件分類(text classification)之中,字庫法(bag-of-words representation)為最常被用來表達每一篇文章的表示法;而這表示法也導致了文件分類最主要的特性之一:資料的微度相當高也很稀疏。如何減少資料的維度使得學習演算法能有更佳的表現仍然是一個進行中的研究。一般最常使用的兩種從不同角度來減少資料維度的方法為:特徵選取(feature selection),以及特徵萃取(feature extraction)。特徵萃取法是指從原本的資料的維度中尋找一個子空間,並且將原本的資料投影在這子空間上;特徵選取法則是依據一個給定的特徵評分方式來對特徵排名並且做選取的動作。在這篇論文中,我們的”輪循式字庫建立法”(round robin bag-of-words generation)則是結合了特徵選取以及特徵萃取的概念。”輪循式字庫建立法”主要可以分為兩個階段。在第一階段中,我們利用了文件分類中常用的潛在語義標引法(latent semantic indexing)來萃取每一個類別中的主要概念,再依據每個特徵再形成這條概念時對應的係數來對特徵做排序的動作;在第二階段中,我們引進了輪循式的概念對每類排序好的特徵做選取的動作。實驗的結果顯示出,我們的方法比互訊息(mutual information)以及卡方測試這兩個傳統的特徵評估方式,以及與使用全部的特徵在四個標準的文件資料集中表現的要好。我們更進一步的將輪循式的字庫建立概念引入戶訊息以及卡方測試中。實驗的結果展示了我們的方法可以與其他傳統的特徵評估法結合,並且在不同的資料中展現出不同的優勢。

In text classification, bag-of-words representation is a widely used method to represent documents, and it leads to a major characteristic: the high dimensionality of the sparse data. How to reduce the dimension of data efficiently to achieve a better performance in a learning task is an on-going research. There are two different approaches, feature extraction and feature selection, to reduce the data dimension. Feature extraction (dimension reduction) algorithms are finding an efficient subspace of the original space and then projecting the data on the subspace; feature selection methods are selecting informative features according to different feature evaluation criteria. In this thesis, we propose the "Round Robin Bag-of-words Generation (RRBWG)" for text classification which combines the ideas of dimension reduction and feature selection. The RRBWG method contains two stages. In the first stage, we extract the main concept of each category by the famous dimension reduction method, Latent Semantic Indexing (LSI); then we rank features by corresponding coefficient in extracted concepts for each category, since these concepts are the linear combination of the features. In the second stage, we select resulting ranked features in the first stage from each category by round robin selecting strategy to generate the bag-of-words. Comparison results show that RRBWG performs better than classical feature selection method with two feature scoring criteria, MI and chi-square-test and without feature selection methods consistently in four benchmark text collections. Furthermore, we introduce RRBWG into these two feature scoring criteria. Experimental results show that RRBWG can cooperate with various feature scoring criteria with different strength on benchmark datasets.

1 Introduction 1.1 Text Classification 1.2 Background 1.3 Organization of Thesis 2 Feature Selection and Dimension Reduction 2.1 Feature Selection 2.1.1 Chi-square Test 2.1.2 Mutual Information 2.2 Dimension Reduction 2.2.1 Principal Component Analysis 2.2.2 Latent Semantic Indexing via Singular Value Decomposition 2.3 Feature Selection vs. Dimension Reduction 3 Round Robin Bag-of-Words Generation 3.1 Motivation 3.2 Round Robin Bag-of-words Generation 4 Classification Methods 19 4.1 Support Vector Machines 4.2 Smooth Support Vector Machines 4.3 One-Against-Rest 5 Numerical Results and Comparisons 28 5.1 Performance Measures 5.1.1 Precision and Recall 5.1.2 F1-Measure and Precision/Recall Break-Even Point 5.1.3 Macro and Micro Average 5.2 Dataset Descriptions 5.3 Standard Text preprocessing 5.3.1 Stopwords Removal and Stemming 5.3.2 Term Weighting 5.4 Experiments Setting 5.5 Experimental Results and Comparisons 6 Conclusion and Future Work 6.1 Conclusion 6.2 Future Work

[1] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector-space model. Proceedings of the 17th international conference on Computational linguistics, 4:79-85, 1998.
[2] M. W. Berry, S.T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM review, 573-595, 1995.
[3] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, second edition, 1999.
[4] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998.
[5] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management, 78-87, 2004.
[6] S. Chakrabarti. Mining the Web. Morgan Kaufmann Publishers, 2003.
[7] C. Chen and O. L. Mangasarian. Smoothing methods for convex inequalities and linear complementarity problems. Mathematical Programming, 71(1):51-69, 1995.
[8] C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and mixed complementarity problems. Computational Optimization and Applications,5(2):97-138, 1996.
[9] V. Cherkassky and F. Mulier. Learning from Data - Concepts, Theory and Methods. John Wiley & Sons, New York, 1998.
[10] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, 2000.
[11] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 391-407, 1990.
[12] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Large Margin Classifiers, 171{203, 2000.
[13] G. Forman. A pitfall and solution in multi-class feature selection for text classification. ICML '04: Proceedings of the twenty-first international conference on Machine learning, 9:38-46, 2004.
[14] M. A. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering, 15:1437-1447, 2003.
[15] T. Hofmann. Probabilistic latent semantic indexing. In Proceeding of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, 50-57, 1999.
[16] T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Puhlishers, 2002.
[17] H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37-53, 2005.
[18] T. G. Kolda and D. P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing information retrieval. ACM Transactions on Information Systems, 16(4):322-346, 1998.
[19] U. Kre¼el. Pairwise classification and support vector machines. Advances in Kernel Methods - Support Vector Learning, 255-268, 1999.
[20] Y. J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5-22, 2001.
[21] Y. J. Lee, O. L. Mangasarian, and W. H. Wolberg. Survival-time classification of breast cancer patients. Computational Optimization and Applications, 25:151-166, 2003.
[22] E. Leopold and J. Kindermann. Text categorization with support vector machines. How to represent texts in input space? Machine Learning, 46:423-444, 1 2002.
[23] K. C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316-327, jun 1991.
[24] O. L. Mangasarian. Mathematical programming in neural networks. ORSA Journal on Computing, 5(4):349-360, 1993.
[25] O. L. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, PA, 1994.
[26] O. L. Mangasarian and D. R. Musicant. Successive overrelaxation for support vector machines. IEEE Transactions on Neural Networks, 10:1032-1037, 1999.
[27] MATLAB. User's Guide. The MathWorks, Inc., Natick, MA 01760, 1994-2001.
[28] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text from labeled and unlabeled documents. Proceedings of AAAI-98, 15th Conference of the American Association for Arti‾cial Intelligence, 792-799, 1998.
[29] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. IEEE Conference on Computer Vision and Pattern Recognition, 130-136, 1997.
[30] M. Porter. An algorithm for su±x stripping. Program Automated Library and Information Systems, 14(1):130-137, 1980.
[31] J. Rousu, C. Saunders, S. Szedmak, and J. S. Taylor. Learning hierarchical multi-category text classification models. ICML '05: Proceedings of the 22nd international conference on Machine learning, 744-751, 2005.
[32] S. Roweis. Finding the first few eigenvectors in a large space. informal note, Dec 1996.»roweis/notes.html.
[33] J. Shlens. A tutorial on principal component analysis. Dec 2005.»elaw/papers/pca.pdf.
[34] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
[35] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.
[36] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Puhlishers, 2 edition, 2005.
[37] L. Wolf and A. Shashua. Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach. J. Mach. Learn. Res., 6:1855-1887, 2005.
[38] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. 14th International Conference on Machine Learning, 412-420, 1997.
[39] S. Yu, K. Yu, V. Tresp, H. P. Kriegel, and M. Wu. Supervised probabilistic principal component analysis. 464-473, 2006.