簡易檢索 / 詳目顯示

研究生: 黃子洋
Tzu-Yang Huang
論文名稱: 自動分類文件收集器應用於適應性文件分類
Automatic Categorized Document Collection for Adaptive Text Classification
指導教授: 李育杰
Yuh-Jye Lee
口試委員: 張源俊
Yuan-Chin Chang
陳素雲
Su-Yun Huang
鮑興國
Hsing-Kuo Pao
戴碧如
Bi-Ru Dai
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 57
中文關鍵詞: 文件分類概念擷取法平滑式支撐向量機
外文關鍵詞: concept extraction, Really Simple Syndication
相關次數: 點閱:235下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在現實生活中,已標類的資料(labeled data)有時是稀少且珍貴的。在文件分類(text classification)上,我們正面臨此類的問題,我們必須耗費相當的人力去閱讀一篇文章才能正確地將其標類。某些已發展的技術用來克服這樣的問題。然而,我們提出一種方法能夠自動地、持續地、便捷地收集到大量的已標類資料。
一種叫做Really Simple Syndication (RSS) 的結構化文件被用來儲存並傳輸新的文章,而且大多數的RSS都會含有一個討論主題。藉由RSS這樣的特性,我們就能將RSS所大量收集到的文章標注上它們的類別,亦即它們的討論主題。在我們的實驗當中,我們從網路上收集到若干個不同類別的RSS,並且架設一台伺服器持續的向這些RSS收集文章。我們將這些文章儲存到一個資料庫並且記錄它們的類別。在我們的設定下,我們的系統可以輕鬆地在一天內收集到上千篇的文章。再者,我們嘗試著去驗證以這樣的方式所收集到的文章是可靠的。因此,我們使用概念擷取法(concept extraction)擷取每個類別的概念字(concept tokens)並且以平滑式支撐向量機(smooth support vector machines)當作我們的分類器來測試我們收集的資料。此外,我們還利用該分類器去預測我們額外從兩個網站上收集到的文章。這些實驗的結果都是令人滿意的。最後,我們希望這個系統能夠解決在文件分類上已標類資料缺少的問題。


In the real world, the labeled data are sometimes very few and expensive. In text classification, we are facing this kind of problems, it demands a lot of human effort to read over and correctly label an article. Some techniques have been developed to conquer this problem. However, we came up with an approach which collects enormous labeled data automatically, permanently and quickly.

A structured document called Really Simple Syndication (RSS) was created to store and transport new articles. Mostly, an RSS feed will stick with a topical subject. Due to this characteristic of RSS, we can collect articles from RSS and assign the subject as a class label to those collected articles. In our works, we chose a certain amount of RSS feeds with various topics on the Internet, and we build up a web crawler to keep crawling these RSS feeds. We stored those collected articles into a database and recorded their subjects. In our setup, our system can effortlessly collect thousands of labeled articles in one day. Furthermore, we attempt to explain that using this method to collect data is reliable. Therefore, we use concept extraction method to extract the concept tokens and smooth support vector machines as our classification method to test our dataset. Moreover, we use the classifier to predict 2 more extra websites which we collected exclusively. These experiments provide satisfied results. Finally, we expect that our system can be used to solve the problem of lacking labeled data in text classification.

1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Our Main Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Framework of Automatic Categorized Document Collection 6 2.1 Structured Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Framework of Automatic Categorized Document Collection . . . . . . . . . 9 3 Text Mining Techniques 13 3.1 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Stopwords Removal and Stemming . . . . . . . . . . . . . . . . . . 14 3.1.2 Bag-of-Words Representation . . . . . . . . . . . . . . . . . . . . . 14 3.2 Term Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 I 3.3.1 χ2- Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Latent Semantic Indexing via Singular Value Decomposition . . . . . . . . 19 3.5 Round Robin Bag-of-Words Generation . . . . . . . . . . . . . . . . . . . . 20 4 Classification Method 21 4.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Smooth Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Reduced Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 27 4.4 One-Against-the-Rest for Multi-class Classification Problems . . . . . . . . 28 5 Experiments 30 5.1 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.1 Precision, Recall and F -Measure . . . . . . . . . . . . . . . . . . . 31 5.1.2 Macro and Micro Averages . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Dataset Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Experiments Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6 Conclusion and Future Work 49 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

[1] B. Adelberg. Nodose: A tool for semi-automatically extracting structured and semistructured
data from text documents. ACM SIGMOD Record, 27:283–294, 1998.
[2] M. W. Berry, S.T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent
information retrieval. SIAM review, 573–595, 1995.
[3] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, second
edition, 1999.
[4] E. J. Bredensteiner and K. P. Bennett. Multicategory classification by support vector
machines. Computational Optimization and Applications, 12:53–79, 1999.
[5] P. Buneman. Semistructured data. In Proceedings of the 16th ACM SIGACT-
SIGMOD-SIGART symposium on Principles of Database Systems, Tucson, Arizona,
1997.
[6] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2:121–167, 1998.
[7] M. R. Girgis C. H. Chang, M. Kayed and K. Shaalan. A survey of web information
extraction systems. IEEE Transactions on Knowledge and Data Engineering,
18:1411–1428, 2006.
[8] C.N. Hsu C.H. Chang and S.C. Lui. Automatic information extraction from semistructured
web pages by pattern discovery. Decision Support Systems, 35:129–147,
2003.
[9] S. Chakrabarti. Mining the Web. Morgan Kaufmann Publishers, 2003.
[10] C. Chen and O. L. Mangasarian. Smoothing methods for convex inequalities and
linear complementarity problems. Mathematical Programming, 71(1):51–69, 1995.
[11] C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and
mixed complementarity problems. Computational Optimization and Applications,
5(2):97–138, 1996.
[12] W.H. Chen. Round robin bag-of-words generation for text classification. Technical
report, Department of Computer Science and Information Engineering, National
Taiwan University of Science and Technology, Taipei 106, Taiwan, 2007.
[13] V. Cherkassky and F. Mulier. Learning from Data - Concepts, Theory and Methods.
John Wiley & Sons, New York, 1998.
[14] K.C. Chin. Personalized ranking for meta-search engine by using ssvm. Technical
report, Department of Computer Science and Information Engineering, National
Taiwan University of Science and Technology, Taipei 106, Taiwan, 2005.
[15] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-
Interscience, New York, NY, USA, 1991.
[16] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.
Cambridge University Press, Cambridge, 2000.
[17] Lewis D. D. An evaluation of phrasal and clustered representations on a text categorization
task. In In Proceeding of SIGIR-92, 15th ACM International Conference
on Research and Development in Information Retrieval, pages 37–50, Copenhagen,
Denmark, 1992.
[18] A.H.F. Laender B.A.Ribeiro-Neto A.S. da Silva and J.S. Teixeira. A brief survey of
web data extraction tools. ACM SIGMOD Record, 31:84–93, 2002.
[19] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing
by latent semantic analysis. Journal of the Society for Information Science,
391–407, 1990.
[20] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector
regression machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances
in Neural Information Processing Systems -9-, pages 155–161, Cambridge, MA, 1997.
MIT Press.
[21] S. R. Gunn. Support vector machines for classification and regression. Technical
report, Image Speech and Intelligent Systems Research Group, University of
Southampton, 1997.
[22] T. Hofmann. Probabilistic latent semantic indexing. In Proceeding of SIGIR-99,
22nd ACM International Conference on Research and Development in Information
Retrieval, 50–57, 1999.
[23] T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer
Academic Puhlishers, 2002.
[24] Karen Sparck Jones. A statistical interpretation of term specificity and its application
in retrieval. Journal of documentation, 28:11–20, 1972.
[25] H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with
support vector machines. Journal of Machine Learning Research, 6:37–53, 2005.
[26] T. G. Kolda and D. P. O’Leary. A semidiscrete matrix decomposition for latent
semantic indexing information retrieval. ACM Transactions on Information Systems,
16(4):322–346, 1998.
[27] U. Kreßel. Pairwise classification and support vector machines. Advances in Kernel
Methods - Support Vector Learning, 255–268, 1999.
[28] Y.-J. Lee, W.-F. Hsieh, and C.-M. Huang. ǫ-SSVR: A Smooth Support Vector Machine
for ǫ-insensitive Regression. IEEE Transactions on Knowledge and Data Engi-
neering, 17(5):678–685, 2005.
[29] Y. J. Lee and S. Y. Huang. Reduced support vector machines: a statistical theory.
IEEE Transactions on Neural Networks, 18:1–13, 2007.
[30] Y. J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In
First SIAM International Conference on Data Mining, Chicago, 2001.
[31] Y. J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine. Com-
putational Optimization and Applications, 20:5–22, 2001.
[32] Y.-J. Lee, O. L. Mangasarian, and W. H. Wolberg. Survival-time classification
of breast cancer patients. Technical Report 01-03, Data Mining Institute,
Computer Sciences Department, University of Wisconsin, Madison, Wisconsin,
March 2001. Computational Optimization and Applications, to appear.
ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-03.ps.
[33] O. L. Mangasarian. Mathematical programming in neural networks. ORSA Journal
on Computing, 5(4):349–360, 1993.
[34] O. L. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, PA, 1994.
[35] O. L. Mangasarian. Generalized Support Vector Machines. In A. Smola, P. Bartlett,
B. Sch¨olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers,
pages 135–146, Cambridge, MA, 2000. MIT Press. ftp://ftp.cs.wisc.edu/mathprog/
tech-reports/98-14.ps.
[36] MATLAB. User’s Guide. The MathWorks, Inc., Natick, MA 01760, 1994-2001.
http://www.mathworks.com.
[37] D. Suciu O. Etzioni M.J. Cafarella, C. Re and M. Banko. Structured querying of
web text. In 3rd Biennial Conference on Innovative Data Systems Research (CIDR),
Asilomar, California, 2007.
[38] M. Browne M.W. Berry. Understanding Search Engines: Mathematical Modeling and
Text Retrieval. Society for Industrial and Applied Mathematics Philadelphia, 2005.
[39] K. Raftopoulos et al. N.K. Papadakis, D. Skoutas. Stavies: A system for information
extraction from unknown web data sources through automatic web wrapper
generation using clustering techinques. IEEE Transactions On Knowledge and Data
Engineering, 17:1638–1652, 2005.
[40] D. Downey S. Kok et al. O. Etzioni, M. Cafarella. Web-scale information extraction
in knowitall (preliminary results). In Proceedings of the 13th international conference
on World Wide Web, New York, NY, 2004.
[41] M. Porter. An algorithm for suffix stripping. Program Automated Library and Infor-
mation Systems, 14(1):130–137, 1980.
[42] C. Yu S. Al-Khalifa and H.V. Jagadish. Querying structured text in an xml database.
In Proceedings of the 2003 ACM SIGMOD international conference, San Diego, CA,
2003.
[43] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval.
McGraw-Hill, Inc., New York, NY, USA, 1986.
[44] Hinrich Sch¨utze, David A. Hull, and Jan O. Pedersen. A comparison of classifiers
and document representations for the routing problem. In SIGIR ’95: Proceedings of
the 18th annual international ACM SIGIR conference on Research and development
in information retrieval, pages 229–237, New York, NY, USA, 1995. ACM Press.
[45] The World-Wide Web Consortium (W3C) Markup Validation Service. Rss 2.0 specification,
1997. http://validator.w3.org/feed/docs/rss2.html.
[46] A. J. Smola and B. Sch¨olkopf. A tutorial on support vector regression. Technical report,
Produced as part of the ESPRIT Working Group in Neural and Computational
Learning II, NeuroCOLT2 27150, October 1998.
[47] L.N. Trefethen and D. Bau. Numerical Linear Algebra. Society for Industrial and
Applied Mathematics Philadelphia, 1997.
[48] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, New
York, 1982.
[49] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
[50] The World-Wide Web Consortium (W3C). Extensible markup language (xml), 1997.
http://www.w3.org/XML/.
[51] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques. Morgan Kaufmann Puhlishers, 2 edition, 2005.
[52] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text
categorization. In ICML ’97: Proceedings of the Fourteenth International Confer-
ence on Machine Learning, pages 412–420, San Francisco, CA, USA, 1997. Morgan
Kaufmann Publishers Inc.

QR CODE