研究生: |
Yardin Heidsyam Yardin - Heidsyam |
---|---|
論文名稱: |
循序型資料分群及分類法整合架構於資料分析之研究 A Framework of Sequential Data Clustering and Classification for Data Analysis |
指導教授: |
楊朝龍
Chao-Lung Yang |
口試委員: |
歐陽超
Chao Ou-Yang 喻奉天 Vincent F. Yu |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 工業管理系 Department of Industrial Management |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 英文 |
論文頁數: | 63 |
中文關鍵詞: | 循序型資料分群及分類 、階層式分群 、決策樹分析 |
外文關鍵詞: | Sequential Clustering and Classification, Hierarchical Clustering, Decision Tree |
相關次數: | 點閱:249 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
傳統的統計方法如多變量變異數分析(MANOVA)或正典分析已被大量地應用於資料分析工作上,但由於線性模式的假設使得該方法在複雜資料的分析上有局限。本研究的目標在發展一個整合資料分群(clustering)及資料分類(classification)技術的架構來有效地分析複雜資料。結合資料分群及資料分類的特性,並將之施行於不同的資料:量化量測資料 (Q 資料集) 及類別型資料 (X資料集)來進行資料分析。資料分群方法首先施行於Q資料集,將量化資料進行分群。其分群的結果(即資料的分群標示)再結合X資料集以作為資料分類方法的施行對象。資料分類法的結果將用於分析類別型資料與分群結果的關連性。在本研究中,基於樹狀結構的可讀性及成果的穩定性,階層分群法與決策樹方法Classification and Regression Tree (CART)將作為資料分群及分類方法的代表。為了能同時檢視資料分群及分類方法的成效以選擇適合的群組數以平衡兩方法的效能,我們提出一個視覺化的資料呈現方式Clustering Classification Evaluation plot (CCE) 圖來決定群組數。其中分群方法的結果以互補之變異數作為呈現,而分類方法則以分類之正確性作為比較的對象。本研究將所提出的方法運用於數個公開資料集,其結果發現利用CCE圖可提供資料分析決策者作為群組數選擇的依據。同時應用資料分群及資料分類亦可使資料分析不需要受到特定資料的分佈或線性假設所限制。
Due to the model assumption, the traditional statistical methods such as multivariate analysis of variance (MANOVA) and Canonical Correlation Analysis (CCA) have the limitation on analyze the complicated dataset in the real world nowadays. Applying data mining techniques such as clustering and classification algorithms are promising to reveal and analyze the multiple-attribute dataset. In this research, a framework integrating clustering and classification which are applied on different datasets: numerical measures (Q dataset) and categorical feature (X dataset), respectively, was proposed. The clustering method is expected to help on rapidly analyzing or identifying the numerical measures (Q dataset). The clustering results, labels, are then combined with X dataset as the inputs of the classification model which classifies the clustering labels by using X dataset. In this research, hierarchical clustering and Classification and Regression Tree (CART) are used to present clustering and classification methods, respectively, based on the their tree structure characteristic. In order to maintain the balanced performance of clustering and classification learning simultaneously, Clustering Classification Evaluation plot (CCE) plot was proposed to show performance measures of both clustering and classification results together. Here, clustering quality is measured by using complimentary sum squared of error (〖SSE〗_com) and classification performance is measured by the accuracy of prediction. Several real life datasets are used to evaluate the proposed framework. The results shows that CCE plots can be used to determine the number of clusters which is an important parameter affecting the performance of the propose framework.
Amadore, A., Bosurgi, G. and Pellegrino, O. (2012). "Studies About Hot Mix Asphalt Density by Means of Fuzzy Clustering Techniques." Procedia - Social and Behavioral Sciences 53: 307-325.
Amooee, G., Minaei-Bidgoli, B. and Bagheri-Dehnavi, M. (2011). "A Comparison Between Data Mining Prediction Algorithms for Fault Detection (Case study: Ahanpishegan co.)." IJCSI International Journal of Computer Science Issues 8(6 (3)): 425-431.
Arun Kumar, M. and Gopal, M. (2010). "A hybrid SVM based decision tree." Pattern Recognition 43(12): 3977-3987.
Bae, J. K. and Kim, J. (2011). "Product development with data mining techniques: A case on design of digital camera." Expert Systems with Applications 38(8): 9274-9280.
Bakır, B., Batmaz, İ., Gunturkun, F. A., İpekci, İ. A., Koksal, G. and Ozdemirel, N. E. (2008). "Defect Cause Modeling with Decision Tree and Regression Analysis." World Academy of Science, Engineering and Technology 24: 821-824.
Bharill, N. and Tiwari, A. (2011). An improved multiobjective simultaneous learning framework for designing a classifier. Recent Trends in Information Technology (ICRTIT), 2011 International Conference on.
Breaban, M. E. and Luchian, H. (2011). PSO Aided K-Means Clustering: Introducing Connectivity in K-Means. Proceedings of the 13th annual conference on Genetic and evolutionary computatio, New York, NY, USA, ACM.
Cai, W., Chen, S. and Zhang, D. (2009). "A simultaneous learning framework for clustering and classification." Pattern Recognition 42(7): 1248-1259.
Cai, W., Chen, S. and Zhang, D. (2010). "A Multiobjective Simultaneous Learning Framework for Clustering and Classification." Neural Networks, IEEE Transactions on 21(2): 185-200.
Calinski, R. B. and Harabasz, J. (1974). "A Dendrite Method for Cluster Analysis." Comm. in Statistics 3: 1-27.
Chen, F.-L. and Li, F.-C. (2010). "Combination of feature selection approaches with SVM in credit scoring." Expert Systems with Applications 37(7): 4902-4909.
Chen, Y.-L. and Hung, L. T.-H. (2009). "Using decision trees to summarize associative classification rules." Expert Systems with Applications 36(2): 2338-2351.
Chien, C.-F., Hsu, C.-Y. and Hsiao, C.-W. (2011). "Manufacturing intelligence to forecast and reduce semiconductor cycle time." Journal of Intelligent Manufacturing 23(6): 2281-2294.
Chien, C.-F., Wang, W.-C. and Cheng, J.-C. (2007). "Data mining for yield enhancement in semiconductor manufacturing and an empirical study." Expert Systems with Applications 33(1): 192-198.
Choudhary, A. K., Harding, J. A. and Tiwari, M. K. (2009). "Data mining in manufacturing: a review based on the kind of knowledge." Journal of Intelligent Manufacturing 20(5): 501-521.
Davies, D. L. and Bouldin, D. W. (1979). "A cluster separation measure." IEEE Trans Pattern Anal Mach Intell 1(2): 224-227.
Dunn, J. C. (1974). "Well-Separated Clusters and Optimal Fuzzy Partitions." Journal of Cybernetics 4(1): 95-104.
Fayyad, U. and Stolorz, P. (1997). "Data mining and KDD: Promise and challenges." Future Generation Computer Systems 13(2-3): 99-115.
Halkidi, M. and Vazirgiannis, M. (2001). Clustering validity assessment: finding the optimal partitioning of a data set. Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on.
Halkidi, M., Vazirgiannis, M. and Batistakis, Y. (2000). Quality Scheme Assessment in the Clustering Process. Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, Springer-Verlag: 265-276.
Harding, J. A., Shahbaz, M., Srinivas and Kusiak, A. (2006). "Data Mining in Manufacturing: A Review." Journal of Manufacturing Science and Engineering 128(4): 969.
Hassoun, M. and Rabinowitz, G. (2010). "Hunting Down the Bubble Makers in Fabs." Semiconductor Manufacturing, IEEE Transactions on 23(1): 13-20.
Hsu, S.-C. and Chien, C.-F. (2007). "Hybrid data mining approach for pattern extraction from wafer bin map to improve yield in semiconductor manufacturing." International Journal of Production Economics 107(1): 88-103.
Hu, C.-H. and Su, S.-F. (2004). Hierarchical clustering methods for semiconductor manufacturing data. Networking, Sensing and Control, 2004 IEEE International Conference on.
Hubert, L. and Arabie, P. (1985). "Comparing partitions." Journal of Classification 2(1): 193-218.
Jain, A. K. (2010). "Data clustering: 50 years beyond K-means." Pattern Recognition Letters 31(8): 651-666.
Jain, A. K., Duin, R. P. W. and Jianchang, M. (2000). "Statistical pattern recognition: a review." Pattern Analysis and Machine Intelligence, IEEE Transactions on 22(1): 4-37.
Jain, A. K., Murty, M. N. and Flynn, P. J. (1999). "Data clustering: a review." ACM Computing Surveys 31(3): 264-323.
Koksal, G., Batmaz, İ. and Testik, M. C. (2011). "A review of data mining applications for quality improvement in manufacturing industry." Expert Systems with Applications 38(10): 13448-13467.
Kovacs, F., Legany, C. and Babos, A. (2005). Cluster Validity Measurement Techniques. Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence, Budapest.
Kundu, B., White, K. P., Jr. and Mastrangelo, C. (2002). Defect clustering and classification for semiconductor devices. Circuits and Systems, 2002. MWSCAS-2002. The 2002 45th Midwest Symposium on.
Liu, C.-W. and Chien, C.-F. (2012). "An intelligent system for wafer bin map defect diagnosis: An empirical study for semiconductor manufacturing." Engineering Applications of Artificial Intelligence: 1-8.
Liu, Y., Li, Z., Xiong, H., Gao, X. and Wu, J. (2010). "Understanding of Internal Clustering Validation Measures." 2010 IEEE International Conference on Data Mining: 911-916.
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J. and Wu, S. (2012). "Understanding and Enhancement of Internal Clustering Validation Measures." IEEE Trans Syst Man Cybern B Cybern.
Mak, B. and Munakata, T. (2002). "Rule extraction from expert heuristics: A comparative study of rough sets with neural networks and ID3." European Journal of Operational Research 136(1): 212-229.
Maulik, U. and Bandyopadhyay, S. (2002). "Performance evaluation of some clustering algorithms and validity indices." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12): 1650-1654.
Ooi, M. P.-L., Sok, H. K., Kuang, Y. C., Demidenko, S. and Chan, C. (2013). "Defect cluster recognition system for fabricated semiconductor wafers." Engineering Applications of Artificial Intelligence 26 1029–1043.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." Journal of Computational and Applied Mathematics 20(0): 53-65.
Rus, G. (2009). Data Mining of Agricultural Yield Data: A Comparison of Regression Models. Advances in Data Mining. Applications and Theoretical Aspects. P. Perner, Springer Berlin Heidelberg. 5633: 24-37.
Sebzalli, Y. M. and Wang, X. Z. (2001). "Knowledge discovery from process operational data using PCA and fuzzy clustering." Engineering Applications of Artificial Intelligence 14(5): 607-616.
Sharma, S. (1995). Applied Multivariate Techniques. New York, Wiley; 1st edition (October 1995).
Skinner, K. R., Montgomery, D. C., Runger, G. C., Fowler, J. W., Mccarville, D. R., Rhoads, T. R. and Stanley, J. D. (2002). "Multivariate statistical methods for modeling and analysis of wafer probe test data." Semiconductor Manufacturing, IEEE Transactions on 15(4): 523-530.
Tan, P.-N., Steinbach, M. and Kumar, V. (2005). Introduction to: Data Mining, Addison-Wesley Longman.
Tan, P.-N., Steinbach, M. and Kumar, V. (2006). Introduction to Data MIning. Boston, United States of America, Pearson Education, Inc.
Theodoridis, S. and Koutroumbas, K. (2006). Pattern Recognition (Third Edition) San Diego, Academic Press.
Veloso, A., Meira, W. and Zaki, M. J. (2006). Lazy Associative Classification. Data Mining, 2006. ICDM '06. Sixth International Conference on.
Venkatesan, A. and Parthiban, L. (2011). "Clustering of datasets using PSO-K_Means and PCA-K-Means." International Journal of Computational Intelligence and Informatics 1: 180-184.
Wang, K. (2007). "Applying data mining to manufacturing: the nature and implications." Journal of Intelligent Manufacturing 18(4): 487-495.
Wei, Z., Yafei, W. and Dan, L. (2010). A dynamic feature selection method based on combination of GA with K-means. Industrial Mechatronics and Automation (ICIMA), 2010 2nd International Conference on.
Wu, J., Xiong, H. and Chen, J. (2009). "Adapting the right measures for K-means clustering." 877.