簡易檢索 / 詳目顯示

研究生: 許大為
David - Alexandre
論文名稱: 使用向量空間模型改善之萬用演算法為基礎之文件分群
Meta-heuristic Based Document Clustering using Vector Space Model Modification
指導教授: 郭人介
Ren-Jieh Kuo
口試委員: 喻奉天
Vincent F. Yu
林希偉
Shi-Woei Lin
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 90
中文關鍵詞: Cluster analysisDocument clusteringMeta-heuristicVector space model
外文關鍵詞: Cluster analysis, Document clustering, Meta-heuristic, Vector space model
相關次數: 點閱:322下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

This study attempts to employ document clustering algorithm by using meta-heuristic algorithm and vector space model modification method to improve the performance of document clustering. The proposed vector space model modification in document clustering algorithm is performed in the similarity calculation. In order to optimize the usability, the vector space model modification is combined with meta-heuristic algorithm. The document clustering with vector space model modification tries to find the most significant part in each document vector space to calculate the highest similarity between two documents instead of calculate the similarity from whole document vector space. This proposed algorithm is compared with the ordinary method using four benchmark data sets, SMS Spam Detection, WebKB, Reuters-8, and Reuters-52. The simulation results indicate that document clustering algorithm with vector space model modification can improve the performances.


This study attempts to employ document clustering algorithm by using meta-heuristic algorithm and vector space model modification method to improve the performance of document clustering. The proposed vector space model modification in document clustering algorithm is performed in the similarity calculation. In order to optimize the usability, the vector space model modification is combined with meta-heuristic algorithm. The document clustering with vector space model modification tries to find the most significant part in each document vector space to calculate the highest similarity between two documents instead of calculate the similarity from whole document vector space. This proposed algorithm is compared with the ordinary method using four benchmark data sets, SMS Spam Detection, WebKB, Reuters-8, and Reuters-52. The simulation results indicate that document clustering algorithm with vector space model modification can improve the performances.

ABSTRACT i ACKNOWLEDGEMENT ii CONTENTS iii LIST OF FIGURES v LIST OF TABLES vi Chapter 1 INTRODUCTION 1 1.1. Research Background 1 1.2. Research Objectives 2 1.3. Research Scopes and Constraints 3 1.4. Thesis Organization 3 Chapter 2 LITERATURE SURVEY 5 2.1. Text Document Clustering 5 2.1.1. Basic Concept 5 2.1.2. Data Representation 10 2.1.3. Data Preprocessing 11 2.1.4. Validation Methods 13 2.2. Text Document Clustering Methods 15 2.3. Meta-Heuristic Methods 17 2.3.1. Particle Swarm Optimization (PSO) Algorithm 18 2.3.2. Genetics Algorithm 19 2.4. Meta-heuristics in Text Document Clustering 20 Chapter 3 METHODOLOGY 23 3.1. Document Clustering with Vector Space Modification 26 3.2. Document Clustering with Meta-heuristic and Vector Space Modification 31 3.2.1. Particle Swarm Optimization based Modified Vector Space 31 3.2.2. Genetic Algorithm based Modified Vector Space 34 Chapter 4 COMPUTATIONAL RESULTS AND ANALYSIS 38 4.1. Parameters Setup 38 4.2. Computational Results 41 4.3. Statistical Test 45 4.4. Algorithm Convergence 49 Chapter 5 CONCLUSIONS AND FUTURE RESEARCH 55 5.1. Conclusion 55 5.2. Contributions 57 5.3. Future Research 58 REFERENCE 59 Appendix I COMPUTATIONAL RESULT 62 Appendix II STATISTICAL RESULT OF PROPOSED ALGORITHMS 78

AbdelHamid, N. M., Halim, M. A. & Walee, M., 2013. Bees Algorithm-based Document Clustering. s.l., The 6th International Conference on Information Technology.
Aggarwal, C. C. & Zhai, C. X., 2012. A survey of text clustering algorithms. In: Mining text data. s.l.:Springer US, pp. 77-128.
Andrews, N. O. & Fox, E. A., 2007. Recent developments in document clustering. s.l.:Department of Computer Science, Virginia Tech.
Ayvaz, M. T., 2007. Simultaneous determination of aquifer parameters and zone structures with fuzzy c-means clustering and meta-heuristic harmony search algorithm. Advances in Water Resources, Volume 30, pp. 2326-2338.
Budin, L., Golub, M. & Budin, A., 2010. Traditional techniques of genetic algorithms applied to floating-point chromosome representations.. sign, 1(11), p. 52.
Cormack, G. V., Gomez Hidalgo, J. M. & Puertas Sanz, E., 2007. Feature engineering for mobile (SMS) spam filtering. New York, International ACM Conference on Research and Development in information Retrieval.
Cui, X. & Potok, T. E., 2005. Document Clustering Analysis based on Hybrid PSO + K-Means Algorithm. Journal of Computer Sciences, 27(Special), p. 33.
Cutting, D., Karger, D., Pedersen, J. & Tukey, J., 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. s.l., ACM SIGIR Conference.
Eberhart, R. & Kennedy, J., 1995. A New Optimizer Using Particle Swarm Theory. In: Sixth International Symposium on Micro Machine and Human Science. Nagoya: IEEE, pp. 39-43.
Fung, B. C. M., Wang, K. & Martin, E., 2005. Hierarchical Document Clustering. In: Encyclopedia of Data Warehousing and Mining. s.l.:s.n., pp. 555-559.
Garai, G. & Chaudhuri, B. B., 2004. A novel genetic algorithm for automatic clustering. Pattern Recognition Letters, 25(2), pp. 173-187.
Gomez Hidalgo, J. M., Cajigas Bringas, G., G., P. S. & E., C. G., 2006. Content Based SMS Spam Filtering. Amsterdam, ACM Symposium on Document Engineering.
Hammouda, K. M. & Mohamed, K. S., October 2004. Efficient Phrase-based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10), pp. 1279-1296.
Huang, A., 2008. Similarity measures for text document clustering. New Zealand, The Sixth New Zealand Computer Science Research Student Conference.
Izakian, H., Abraham, A. & Snasel, V., 2009. Fuzzy clustering using hybrid fuzzy c-means and fuzzy particle swarm optimization. ature & Biologically Inspired Computing, pp. 1690-1694.
Jensi, R. & Jiji, W. G., 2013. A Survey on Optimization Approaches to Text Document Clustering. International Journal on Computational Sciences & Applications (IJCSA), 3(6), pp. 31-44.
Jones, G., Robertson, A., Santimetvirul, C. & Willett, P., 1995. Non-hierarchic document clustering using a genetic algorithm. Information Research , Volume 1, pp. 1-1.
Kang, J. & Zhang, W., 2011. Combination of Fuzzy c-means and Harmony Search Algorithms for Clustering of Text Document. Journal of Computational Information Systems, 7(16), pp. 5980-5986.
Kang, J. & Zhang, W., 2012. Combination of Fuzzy c-means and Particle Swarm Optimization for Text Document Clustering. Advances in Electrical Enginnering and Automation, Volume 139, pp. 247-252.
Lee, K. S. & Geem, Z. W., 2005. A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice. Computer Methods in Applied Mechanics and Engineering, 194(36-38), pp. 3902-3933.
Machnik, Ł., 2007. A document clustering method based on ant algorithms. Task Quarterly, 11(1-2), pp. 87-102.
Mahdavi, M., Chehreghani, M. H., Abolhassani, H. & Forsati, R., 2008. Novel meta-heuristic algorithms for clustering web documents. Applied Mathematics and Computation, 201(1-2), pp. 441-451.
Min, W. & Siqing, Y., 2010. Improved K-means clustering based on genetic algorithm. Computer Application and System Modeling (ICCASM), Volume 6.
Moschitti, A. & Basili, R., 2004. Complex Linguistic Features for Text Classification: A Comprehensive Study. In: Advances in Information Retrieval. Berlin Heidelberg: Springer , pp. 181-196.
Nazini, M., Roshna, M. & Shaik, J. H., 2013. Efficiently Measuring Similarities Between Objects in Different Views of Hierarchical Clustering. International Journal of Computer Science and Telecommunications, 4(2), pp. 523-527.
Pessiot, J. F., Kim, Y. M., Amini, M. R. & Gallinari, P., 2010. Improving document clustering in a learned concept space. Information processing & management, 46(2), pp. 180-192.
Porter, M. F., 1980. An Algorithm for Suffix Stripping. Program: electronic library and information systems, 14(3), pp. 130-137.
Premalatha, K. & Natarajan, A. M., 2010. Hybrid PSO and GA models for Document Clustering. International Journal of Soft Computing and Its Applications, 2(3), pp. 302-320.
Schutze, H. & Silverstein, C., 1997. Projections for Efficient Document Clustering. s.l., ACM SIGIR Conference.
Signh, V. K., Tiwari, N. & Garg, S., 2011. Document Clustering using K-means, Heuristic K-means and Fuzzy C-means. Computational Intelligence and Communication Network (CICN), 2011 International Conference on, pp. 297-301.
Sindhiya, B. & Tajunisha, N., December 2013. Concept and Term Based Similarity Measure for Text Classification and Clustering. International Journal of Engineering Research and Development, 9(3), pp. 28-33.
Song, W., Li, C. H. & Park, S. C., 2009. Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Systems with Applications, 36(5), pp. 9095-9104.
Steinbach, M., Karypis, G. & Kumar, V., 2000. A Comparison of Document Clustering Techniques. KDD workshop on text mining, 400(1), pp. 525-526.
Win, T. T. & Mon, L., 2010. Document Clustering by Fuzzy C-Mean Algorithm. Advanced Computer Controll (ICACC), 2010 2nd International Conference on, Volume 1, pp. 239 - 242.
Xiao, X., Dow, E. R., Eberhart, R. & Miled, Z. B., 2003. A Gene Clustering Using Self-Organizing Maps and Particle Swarm Optimization. In: Parallel and Distributed Processing Symposium. s.l.:IEEE.
Zulvia, F. E., 2010. A Hybrid Particle Swarm Optimization with Genetic Algorithm for Solving Capacitated Vehicle Routing Problem with Fuzzy demand - A Case Study on Garbage Collection System, s.l.: PhD diss., MSc Thesis.

QR CODE