簡易檢索 / 詳目顯示

研究生: Yardin Heidsyam
Yardin - Heidsyam
論文名稱: 循序型資料分群及分類法整合架構於資料分析之研究
A Framework of Sequential Data Clustering and Classification for Data Analysis
指導教授: 楊朝龍
Chao-Lung Yang
口試委員: 歐陽超
Chao Ou-Yang
喻奉天
Vincent F. Yu
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 63
中文關鍵詞: 循序型資料分群及分類階層式分群決策樹分析
外文關鍵詞: Sequential Clustering and Classification, Hierarchical Clustering, Decision Tree
相關次數: 點閱:251下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

傳統的統計方法如多變量變異數分析(MANOVA)或正典分析已被大量地應用於資料分析工作上,但由於線性模式的假設使得該方法在複雜資料的分析上有局限。本研究的目標在發展一個整合資料分群(clustering)及資料分類(classification)技術的架構來有效地分析複雜資料。結合資料分群及資料分類的特性,並將之施行於不同的資料:量化量測資料 (Q 資料集) 及類別型資料 (X資料集)來進行資料分析。資料分群方法首先施行於Q資料集,將量化資料進行分群。其分群的結果(即資料的分群標示)再結合X資料集以作為資料分類方法的施行對象。資料分類法的結果將用於分析類別型資料與分群結果的關連性。在本研究中,基於樹狀結構的可讀性及成果的穩定性,階層分群法與決策樹方法Classification and Regression Tree (CART)將作為資料分群及分類方法的代表。為了能同時檢視資料分群及分類方法的成效以選擇適合的群組數以平衡兩方法的效能,我們提出一個視覺化的資料呈現方式Clustering Classification Evaluation plot (CCE) 圖來決定群組數。其中分群方法的結果以互補之變異數作為呈現,而分類方法則以分類之正確性作為比較的對象。本研究將所提出的方法運用於數個公開資料集,其結果發現利用CCE圖可提供資料分析決策者作為群組數選擇的依據。同時應用資料分群及資料分類亦可使資料分析不需要受到特定資料的分佈或線性假設所限制。


Due to the model assumption, the traditional statistical methods such as multivariate analysis of variance (MANOVA) and Canonical Correlation Analysis (CCA) have the limitation on analyze the complicated dataset in the real world nowadays. Applying data mining techniques such as clustering and classification algorithms are promising to reveal and analyze the multiple-attribute dataset. In this research, a framework integrating clustering and classification which are applied on different datasets: numerical measures (Q dataset) and categorical feature (X dataset), respectively, was proposed. The clustering method is expected to help on rapidly analyzing or identifying the numerical measures (Q dataset). The clustering results, labels, are then combined with X dataset as the inputs of the classification model which classifies the clustering labels by using X dataset. In this research, hierarchical clustering and Classification and Regression Tree (CART) are used to present clustering and classification methods, respectively, based on the their tree structure characteristic. In order to maintain the balanced performance of clustering and classification learning simultaneously, Clustering Classification Evaluation plot (CCE) plot was proposed to show performance measures of both clustering and classification results together. Here, clustering quality is measured by using complimentary sum squared of error (〖SSE〗_com) and classification performance is measured by the accuracy of prediction. Several real life datasets are used to evaluate the proposed framework. The results shows that CCE plots can be used to determine the number of clusters which is an important parameter affecting the performance of the propose framework.

摘 要 I ABSTRACT II CONTENTS III LIST OF TABLES VI LIST OF FIGURES VII CHAPTER 1 INTRODUCTION 1 CHAPTER 2 LITERATURE REVIEW 6 2.1 Clustering 7 2.1.1 Application of Clustering in Manufacturing 7 2.1.2 Hierarchical Clustering Algorithm 8 2.1.3 K Means Clustering Algorithm 11 2.1.4 PSO-K Means Clustering Algorithm 12 2.1.5 Assessing Clustering Quality 13 2.2 Classification 15 2.2.1 Application of classification in manufacturing 16 2.2.2 Choose the Appropriate Classifier 17 2.2.3 Decision Tree 18 2.2.4 Estimation of decision tree quality 20 2.3 Sequential Clustering and Classification 20 CHAPTER 3 RESEARCH METHODOLOGY 23 3.1 Research Motivation 23 3.2 Research Framework 25 3.3 Clustering-Classification Evaluation Plot (CCE Plot) 26 CHAPTER 4 EXPERIMENTAL RESULTS 29 4.1 Datasets 29 4.2 Preliminary Research 31 4.3 Results 33 4.3.1 Auto MPG Dataset 34 4.3.2 Automobile Dataset 37 4.3.3 Credit Approval Dataset (Credit) 40 4.3.4 Switzerland Heart Disease Dataset 42 CHAPTER 5 DISCUSSION AND CONCLUSIONS 45 REFERENCES 47 APPENDIX 53 A. Preliminary Result for Comparing Clustering Indices and Accuracy Pattern on Auto MPG Dataset 53 B. Preliminary Result for Comparing Clustering Indices and Accuracy Pattern on Auto MPG Dataset 53 C. Auto MPG Experimental Results of Hierarchical Clustering with Ward’s Method and CART 53 D. Automobile Experimental Results of Hierarchical Clustering with Ward’s Method and CART 53 E. Credit Approval Experimental Results of Hierarchical Clustering with Ward’s Method and CART 54 F. Switzerland Heart Disease Experimental Results of Hierarchical Clustering with Ward’s Method and CART 54

Amadore, A., Bosurgi, G. and Pellegrino, O. (2012). "Studies About Hot Mix Asphalt Density by Means of Fuzzy Clustering Techniques." Procedia - Social and Behavioral Sciences 53: 307-325.
Amooee, G., Minaei-Bidgoli, B. and Bagheri-Dehnavi, M. (2011). "A Comparison Between Data Mining Prediction Algorithms for Fault Detection (Case study: Ahanpishegan co.)." IJCSI International Journal of Computer Science Issues 8(6 (3)): 425-431.
Arun Kumar, M. and Gopal, M. (2010). "A hybrid SVM based decision tree." Pattern Recognition 43(12): 3977-3987.
Bae, J. K. and Kim, J. (2011). "Product development with data mining techniques: A case on design of digital camera." Expert Systems with Applications 38(8): 9274-9280.
Bakır, B., Batmaz, İ., Gunturkun, F. A., İpekci, İ. A., Koksal, G. and Ozdemirel, N. E. (2008). "Defect Cause Modeling with Decision Tree and Regression Analysis." World Academy of Science, Engineering and Technology 24: 821-824.
Bharill, N. and Tiwari, A. (2011). An improved multiobjective simultaneous learning framework for designing a classifier. Recent Trends in Information Technology (ICRTIT), 2011 International Conference on.
Breaban, M. E. and Luchian, H. (2011). PSO Aided K-Means Clustering: Introducing Connectivity in K-Means. Proceedings of the 13th annual conference on Genetic and evolutionary computatio, New York, NY, USA, ACM.
Cai, W., Chen, S. and Zhang, D. (2009). "A simultaneous learning framework for clustering and classification." Pattern Recognition 42(7): 1248-1259.
Cai, W., Chen, S. and Zhang, D. (2010). "A Multiobjective Simultaneous Learning Framework for Clustering and Classification." Neural Networks, IEEE Transactions on 21(2): 185-200.
Calinski, R. B. and Harabasz, J. (1974). "A Dendrite Method for Cluster Analysis." Comm. in Statistics 3: 1-27.
Chen, F.-L. and Li, F.-C. (2010). "Combination of feature selection approaches with SVM in credit scoring." Expert Systems with Applications 37(7): 4902-4909.
Chen, Y.-L. and Hung, L. T.-H. (2009). "Using decision trees to summarize associative classification rules." Expert Systems with Applications 36(2): 2338-2351.
Chien, C.-F., Hsu, C.-Y. and Hsiao, C.-W. (2011). "Manufacturing intelligence to forecast and reduce semiconductor cycle time." Journal of Intelligent Manufacturing 23(6): 2281-2294.
Chien, C.-F., Wang, W.-C. and Cheng, J.-C. (2007). "Data mining for yield enhancement in semiconductor manufacturing and an empirical study." Expert Systems with Applications 33(1): 192-198.
Choudhary, A. K., Harding, J. A. and Tiwari, M. K. (2009). "Data mining in manufacturing: a review based on the kind of knowledge." Journal of Intelligent Manufacturing 20(5): 501-521.
Davies, D. L. and Bouldin, D. W. (1979). "A cluster separation measure." IEEE Trans Pattern Anal Mach Intell 1(2): 224-227.
Dunn, J. C. (1974). "Well-Separated Clusters and Optimal Fuzzy Partitions." Journal of Cybernetics 4(1): 95-104.
Fayyad, U. and Stolorz, P. (1997). "Data mining and KDD: Promise and challenges." Future Generation Computer Systems 13(2-3): 99-115.
Halkidi, M. and Vazirgiannis, M. (2001). Clustering validity assessment: finding the optimal partitioning of a data set. Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on.
Halkidi, M., Vazirgiannis, M. and Batistakis, Y. (2000). Quality Scheme Assessment in the Clustering Process. Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, Springer-Verlag: 265-276.
Harding, J. A., Shahbaz, M., Srinivas and Kusiak, A. (2006). "Data Mining in Manufacturing: A Review." Journal of Manufacturing Science and Engineering 128(4): 969.
Hassoun, M. and Rabinowitz, G. (2010). "Hunting Down the Bubble Makers in Fabs." Semiconductor Manufacturing, IEEE Transactions on 23(1): 13-20.
Hsu, S.-C. and Chien, C.-F. (2007). "Hybrid data mining approach for pattern extraction from wafer bin map to improve yield in semiconductor manufacturing." International Journal of Production Economics 107(1): 88-103.
Hu, C.-H. and Su, S.-F. (2004). Hierarchical clustering methods for semiconductor manufacturing data. Networking, Sensing and Control, 2004 IEEE International Conference on.
Hubert, L. and Arabie, P. (1985). "Comparing partitions." Journal of Classification 2(1): 193-218.
Jain, A. K. (2010). "Data clustering: 50 years beyond K-means." Pattern Recognition Letters 31(8): 651-666.
Jain, A. K., Duin, R. P. W. and Jianchang, M. (2000). "Statistical pattern recognition: a review." Pattern Analysis and Machine Intelligence, IEEE Transactions on 22(1): 4-37.
Jain, A. K., Murty, M. N. and Flynn, P. J. (1999). "Data clustering: a review." ACM Computing Surveys 31(3): 264-323.
Koksal, G., Batmaz, İ. and Testik, M. C. (2011). "A review of data mining applications for quality improvement in manufacturing industry." Expert Systems with Applications 38(10): 13448-13467.
Kovacs, F., Legany, C. and Babos, A. (2005). Cluster Validity Measurement Techniques. Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence, Budapest.
Kundu, B., White, K. P., Jr. and Mastrangelo, C. (2002). Defect clustering and classification for semiconductor devices. Circuits and Systems, 2002. MWSCAS-2002. The 2002 45th Midwest Symposium on.
Liu, C.-W. and Chien, C.-F. (2012). "An intelligent system for wafer bin map defect diagnosis: An empirical study for semiconductor manufacturing." Engineering Applications of Artificial Intelligence: 1-8.
Liu, Y., Li, Z., Xiong, H., Gao, X. and Wu, J. (2010). "Understanding of Internal Clustering Validation Measures." 2010 IEEE International Conference on Data Mining: 911-916.
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J. and Wu, S. (2012). "Understanding and Enhancement of Internal Clustering Validation Measures." IEEE Trans Syst Man Cybern B Cybern.
Mak, B. and Munakata, T. (2002). "Rule extraction from expert heuristics: A comparative study of rough sets with neural networks and ID3." European Journal of Operational Research 136(1): 212-229.
Maulik, U. and Bandyopadhyay, S. (2002). "Performance evaluation of some clustering algorithms and validity indices." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12): 1650-1654.
Ooi, M. P.-L., Sok, H. K., Kuang, Y. C., Demidenko, S. and Chan, C. (2013). "Defect cluster recognition system for fabricated semiconductor wafers." Engineering Applications of Artificial Intelligence 26 1029–1043.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." Journal of Computational and Applied Mathematics 20(0): 53-65.
Rus, G. (2009). Data Mining of Agricultural Yield Data: A Comparison of Regression Models. Advances in Data Mining. Applications and Theoretical Aspects. P. Perner, Springer Berlin Heidelberg. 5633: 24-37.
Sebzalli, Y. M. and Wang, X. Z. (2001). "Knowledge discovery from process operational data using PCA and fuzzy clustering." Engineering Applications of Artificial Intelligence 14(5): 607-616.
Sharma, S. (1995). Applied Multivariate Techniques. New York, Wiley; 1st edition (October 1995).
Skinner, K. R., Montgomery, D. C., Runger, G. C., Fowler, J. W., Mccarville, D. R., Rhoads, T. R. and Stanley, J. D. (2002). "Multivariate statistical methods for modeling and analysis of wafer probe test data." Semiconductor Manufacturing, IEEE Transactions on 15(4): 523-530.
Tan, P.-N., Steinbach, M. and Kumar, V. (2005). Introduction to: Data Mining, Addison-Wesley Longman.
Tan, P.-N., Steinbach, M. and Kumar, V. (2006). Introduction to Data MIning. Boston, United States of America, Pearson Education, Inc.
Theodoridis, S. and Koutroumbas, K. (2006). Pattern Recognition (Third Edition) San Diego, Academic Press.
Veloso, A., Meira, W. and Zaki, M. J. (2006). Lazy Associative Classification. Data Mining, 2006. ICDM '06. Sixth International Conference on.
Venkatesan, A. and Parthiban, L. (2011). "Clustering of datasets using PSO-K_Means and PCA-K-Means." International Journal of Computational Intelligence and Informatics 1: 180-184.
Wang, K. (2007). "Applying data mining to manufacturing: the nature and implications." Journal of Intelligent Manufacturing 18(4): 487-495.
Wei, Z., Yafei, W. and Dan, L. (2010). A dynamic feature selection method based on combination of GA with K-means. Industrial Mechatronics and Automation (ICIMA), 2010 2nd International Conference on.
Wu, J., Xiong, H. and Chen, J. (2009). "Adapting the right measures for K-means clustering." 877.

無法下載圖示 全文公開日期 2017/07/11 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE