Basic Search / Detailed Display

Author: 蔡至泓
Chih-Hung Tsai
Thesis Title: 利用平滑支撐向量法架構階層式文件分類器
Two Stages Text Classification via Smooth Support Vector Machine
Advisor: 李育杰
Yuh-Jye Lee
Committee: 鮑興國
Hsing-Kuo Pao
何正信
Cheng-Seen Ho
張源俊
Yuan-Chin Chang
林智仁
Chih-Jen Lin
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2005
Graduation Academic Year: 93
Language: 英文
Pages: 57
Keywords (in Chinese): χ2-測試法相互資訊法特徵選擇線性縮減集支撐向量法平滑支撐向量法文件分類
Keywords (in other languages): χ2-test, mutual information, feature, reduced support vector machine, smooth support vector machine, text classification
Reference times: Clicks: 276Downloads: 1
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 文件分類 (text classification) 在最近幾年是相當熱門的研究題目之一。這項技術可以被廣泛的運用在許多地方,如:資訊擷取 (information retrieval)、電子郵件過濾 (email filtering)、主題式檢索 (topic-based retrieval)以及個人化的搜尋引擎等等。本篇論文,我們利用平滑支撐向量法 (smooth support vector machine) 提出了一個兩段式分類器的架構來運用到文件分類上。這個架構有兩個分類器。第一個分類器是藉由平滑支撐向量法及b-調整 (b-tuning) 產生。它的目的是用來提升召回率 (recall)。第二個分類器是用來加強第一個分類器的結果。在這個架構中,我們使用不同的特徵選擇 (feature selection) 方法來選擇特徵,分別是:相互資訊法 (mutual information)、χ2-測試法 (χ2-test)以及文件頻率-熵法 (DF-Entropy), 並比較其結果。由實驗結果來看,文件頻率-熵法獲得最好的結果。除此之外,我們也將線性縮減集支撐向量法 (linear reduced support vector machine) 運用到文件分類上。這個方法是將線性縮減集支撐向量法中的非線性核矩陣用線性核矩陣取代。這個方法的好處是不需要做特徵選擇而且可以減少計算時間。由比較的實驗結果來看,兩段式分類器有最好的分類結果,而線性縮減集支撐向量法也有令人滿意的分類結果。


    In recent years, text classification is one of the popular research topics. The technique can be extensively applied to many cases such as information retrieval, email filtering, topic-based retrieval, and personalization
    of search engines, etc. In this thesis, we propose a two stages linear classifier framework using smooth support vector machine (SSVM) for text classification. This framework contains two classifiers. The first classifier is generated by SSVM and b-tuning procedure. Its main purpose is to increase the recall. The second classifier tries to enhance the performance of the first classifier. We also try several different feature selection methods, mutual information (MI), χ2-test and DF-Entropy, to select terms as features,
    and to compare their classification results. It turns out that the DF-Entropy method has the best performance among these methods. Besides, we also utilize linear reduced support vector machine (RSVM) for text classification. This approach is to use linear kernel to replace non-linear kernel in RSVM. Its advantages are that it needs not do feature selection and can reduce computational time. From the comparison results, two stages linear classifier has the best performance and linear RSVM also has good performance.

    1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Text Classi‾cation 4 2.1 Classi‾cation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Binary Classi‾cation . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Multi-Class Classi‾cation . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Multi-Label Classi‾cation . . . . . . . . . . . . . . . . . . . . . 5 2.2 Preprocessing Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Stopwords and Stemmimg . . . . . . . . . . . . . . . . . . . . . 5 2.2.2 Bag-of-Words Representation . . . . . . . . . . . . . . . . . . . 6 2.2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.4 Term Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Our Preprocessing Procedure before Training Classi‾er . . . . . . 11 2.4 Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 F-Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 Macro-Averaging and Micro-Averaging . . . . . . . . . . . . . 13 3 Linear Classi‾er for Text Classi‾cation 14 3.1 Conventional Support Vector Machines . . . . . . . . . . . . . . . . 15 3.2 Smooth Support Vector Machines . . . . . . . . . . . . . . . . . . . . 19 3.3 Reduced Support Vector Machines . . . . . . . . . . . . . . . . . . . 21 3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.2 Reduced Support Vector Machines (RSVM) of Linear Ker- nel Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Rocchio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 One-Against-the-Rest . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Two Stages Linear Classi‾er 26 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Two Stages Linear Classi‾er . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Experiments 5.1 Dataset Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3 Numerical Results and Comparisons . . . . . . . . . . . . . . . . . . 35 6 Conclusion

    [1] D. P. Bertsekas. Nonlinear Programming. Athena Scienti‾c, Belmont, MA, second edition, 1999.
    [2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998.
    [3] Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kaufmann Publishers, San Francisco, CA, 2002.
    [4] C. Chen and O. L. Mangasarian. Smoothing methods for convex inequalities and linear complementarity problems. Mathematical Programming, 71(1):51-69, 1995.
    [5] C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and mixed complementarity problems. Computational Optimization and Applications, 5(2):97-138, 1996.
    [6] V. Cherkassky and F. Mulier. Learning from Data - Concepts, Theory and Methods. John Wiley & Sons, New York, 1998.
    [7] R. Courant and D. Hilbert. Methods of Mathematical Physics. Interscience Publishers, New York, 1953.
    [8] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, 2000.
    [9] Lewis D. D. An evaluation of phrasal and clustered representations on a text categorization task. In In Proceeding of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval, pages 37{50, Copenhagen, Denmark, 1992.
    [10] Susan Dumais and Hao Chen. Hierarchical classi‾cation of web content. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 256{263, New York, NY, USA, 2000. ACM Press.
    [11] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. In A. Smola, P. Bartlett, B. SchÄolkopf, and D. Schuurmans, editors, Advances in Large Margin Classi‾ers, pages 171{203, Cambridge, MA, 2000. MIT Press.
    [12] T. Joachims. Learning to classify text using support vector machine. Kluwer Academic Publishers, Bostons, 2001.
    [13] Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Douglas H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 143{151, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.
    [14] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N¶edellec and C¶eline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137-142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
    [15] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. Technical Report 00-07, Data Mining Institute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, July 2000. Proceedings of the First SIAM Interna tional Conference on Data Mining, Chicago, April 5-7, 2001, CD-ROM Proceedings.ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps.
    [16] Y.-J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5{22, 2001. Data Mining Institute, University of Wisconsin, Technical Report 99-03. ftp://ftp.cs.wisc.edu/pub/dmi/techreports/99-03.ps.
    [17] B. Liu, Y. Dai, X. Li, W. Lee, and P. Yu. Building text classi‾ers using positive and unlabeled examples, 2003.
    [18] O. L. Mangasarian. Mathematical programming in neural networks. ORSA Journal on Computing, 5(4):349{360, 1993.
    [19] O. L. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, PA, 1994.
    [20] O. L. Mangasarian. Generalized support vector machines. In A. Smola, P. Bartlett, B. SchÄolkopf, and D. Schuurmans, editors, Advances in Large Margin Classi‾ers, pages 135-146, Cambridge, MA, 2000. MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps.
    [21] O. L. Mangasarian and D. R. Musicant. Successive overrelaxation for support vector machines. IEEE Transactions on Neural Networks, 10:1032{1037, 1999. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-18.ps.
    [22] MATLAB. User's Guide. The MathWorks, Inc., Natick, MA 01760, 1994-2001.
    http://www.mathworks.com.
    [23] Kamal Nigam, Andrew K. McCallum, Sebastian Thrun, and Tom M. Mitchell. Learning to classify text from labeled and unlabeled documents. In Proceedings of AAAI-98, 15th Conference of the American Association for Arti‾cial Intelligence, pages 792-799, Madison, US, 1998. AAAI Press, Menlo Park, US.
    [24] J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, New York,1999.
    [25] M. Porter. An algorithm for su±x stripping. Program (Automated Library and
    Information Systems, 14(1):130-137, 1980.
    [26] J. Rocchio. Relevance feedback in information retrieval. In G Salton, editor, The SMART Retrieval System:Experiments in Automatic Document Processing, pages 313{323. Prentice-Hall, Englewood Cli®s,NJ,USA, 1971.
    [27] G. Salton. The smart Retrieval System: Experiments in Automatic Document Processing. Englewood Cli®s, New Jersey, 1971.
    [28] G. Salton. Developments in Automatic Text Retrieval. Science, 253:974-980, August 1991.
    [29] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513{523, 1988.
    [30] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986.
    [31] Hinrich Schutze, David A. Hull, and Jan O. Pedersen. A comparison of classi‾ers and document representations for the routing problem. In Research and Development in Information Retrieval, pages 229-237, 1995.
    [32] Kre¼el U. Pairwise classi‾cation and support vector machines. In B. SchÄolkopf, C. J. C. Burges, and A. J. Smola and, editors, Advances in Kernel Methods - Support Vector Learning, pages 255-268, Cambridge, MA, 1999. MIT Press.
    [33] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
    [34] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.
    [35] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA, 2000. http://www.cs.waikato.ac.nz/»ml/index.html.
    [36] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In ICML '97: Proceedings of the Fourteenth International Confer-ence on Machine Learning, pages 412-420, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.
    [37] Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. J. Am. Soc. Inf. Sci., 47(5):357-369, 1996.

    QR CODE