簡易檢索 / 詳目顯示

研究生: 郭承諭
Chen-Yu Kuo
論文名稱: 結合K-Prototypes分群演算法與改良式正弦餘弦演算法於混合型資料分類之研究
A Hybrid K-Prototypes Clustering Approach with Improved Sine-Cosine Algorithm for Mixed Data Classification
指導教授: 王孔政
Kung-Jeng Wang
口試委員: 歐陽超
Chao Ou-Yang
陳怡永
Yi-Yung Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 65
中文關鍵詞: 分類分群混合型資料正弦餘弦演算法變異機制
外文關鍵詞: Classification, Clustering, Mixed data, Sine-cosine algorithm, Mutation
相關次數: 點閱:236下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

當解決混合型資料(包含數值型資料和類別型資料)的分類問題時,既有監督式學習演算法無法表現完美,然而如K-prototypes演算法的非監督式學習在處理混合型資料卻展現優異的潛力。因此,為同時擁有分群與分類演算法之優點,本研究旨在發展出一個新穎之分群為基礎且專門處理混合型資料的分類演算法。本研究提出之方法應用正弦餘弦演算法以找到每一屬性之最佳權重與K-prototypes之最佳初始群心。對於正弦餘弦演算法而言,其目標函數為全部群心之純度累加。為了避免正弦餘弦演算法落入區域極值,本研究在此演算法中新增一個變異機制,其中包含高斯變異、柯西變異、雷維變異和單點變異。本研究與倒傳遞神經網路、支撐向量機、決策樹、隨機梯度下降法、k-近鄰演算法和高斯貝式 分類器等常用之分類演算法與不同萬用演算法(基因演算法和正弦餘弦演算法)進行比較,使用UCI機器學習資料庫,進行正確率、F值和Cohen's Kappa係數之比較,結果證實本研究之提出方法對混合型資料集有較優越的分類成效。


When dealing with mixed data with numerical and categorical types in a classification problem, supervised learning algorithms do not perform well, while some algorithms show, such as k-prototypes algorithm, show their potential in clustering. Thus, this study aims to develop a novel clustering-based classification algorithm for mixed data classification in order to have both merits of classification and clustering. The proposed algorithm employs a sine-cosine algorithm (SCA) to find attribute weights and initial centroids for a k-prototypes algorithm. The objective function of the algorithm is formulated as a sum-up purity. To increase the capacity of not being trapped in the local optimal solution for SCA, a mutation strategy, containing Gaussian mutation, Cauchy mutation, Levy mutation, and single-point mutation, is added in the original SCA. Comparing to popular classification algorithms (back propagation neural network, support vector machine, decision tree, stochastic gradient descent, k-nearest neighbors and Gaussian naive bayes) and difference metaheuristics to find the optimal initial centers and attribute weights in k-prototypes algorithm, including genetic algorithm and sine-cosine algorithm, by using benchmark data sets from UCI repository, the experimental results of the current study evidenced that the proposed algorithm achieves superior classification performance in terms of accuracy, F-measure and Cohen's Kappa.

摘要 I ABSTRACT II CONTENTS III LIST OF TABLES V LIST OF FIGURES VI CHAPTER 1 INTRODUCTION 7 1.1 Background and motivation 7 1.2 Research objectives 8 1.3 Research scope and constrains 9 1.4 Organization of the thesis 9 CHAPTER 2 LITERATURE REVIEW 11 2.1 Data mining 11 2.2 Clustering (unsupervised learning) 11 2.3 Classification (supervised learning) 12 2.4 Clustering approach-based classification 16 2.5 Genetic algorithm 18 2.6 Sine-cosine algorithm 19 CHAPTER 3 METHODOLOGY 20 3.1 Notations 20 3.2 Algorithm design 24 CHAPTER 4 EXPERIMENTAL RESULTS 30 4.1 Benchmark algorithms and parameters setting 30 4.2 Data description 31 4.3 Accuracy 35 4.4 Cohen's Kappa 40 4.5 Confusion matrix 41 4.6 Computational time 45 4.7 Statistical testing 47 CHAPTER 5 CASE STUDY 51 CHAPTER 6 CONCLUSIONS 54 6.1 Conclusions 54 6.2 Contributions 55 6.3 Future research 55 REFERENCES 57 APPENDIX 60

Ashraf, M., Zaman, M., & Ahmed, M. (2019). To ameliorate classification accuracy using ensemble vote approach and base classifiers. In Emerging Technologies in Data Mining and Information Security (pp. 321-334). Springer, Singapore.
Askarzadeh, A. (2018). A memory-based genetic algorithm for optimization of power generation in a microgrid. IEEE Transactions on Sustainable Energy, 9(3), 1081-1089.
Audigier, V., Husson, F., & Josse, J. (2016). A principal component method to impute missing values for mixed data. Advances in Data Analysis and Classification, 10(1), 5-26.
Behzadi, S., Müller, N. S., Plant, C., & Böhm, C. (2019, April). Clustering of mixed-type data considering concept hierarchies. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 555-573). Springer, Cham.
Chiu, C., Chi, H., Sung, R., & Yuang, J. Y. (2010, November). The hybrid of genetic algorithms and k-prototypes clustering approach for classification. In 2010 International Conference on Technologies and Applications of Artificial Intelligence (pp. 327-330). IEEE.
Cunningham, P., Cord, M., & Delany, S. J. (2008). Supervised learning. In Machine Learning Techniques for Multimedia (pp. 21-49). Springer, Berlin, Heidelberg.
Darwin, C. (1872). The Origin of Species: By Means of Natural Selection or The Preservation of Favored Races In The Struggle For Life (Vol. 1). Modern library.
Das, S., Bhattacharya, A., & Chakraborty, A. K. (2018). Solution of short-term hydrothermal scheduling using sine cosine algorithm. Soft Computing, 22(19), 6409-6427
De Jong, K. A. (1975). Analysis of the behavior of a class of genetic adaptive systems. Ph.D. thesis, University of Michigan, Ann Arbor, MI. Dissertation Abstracts International 36(10), 5140B, University Microfilms Number 76-9381.
Elaziz, M. A., Oliva, D., & Xiong, S. (2017). An improved opposition-based sine cosine algorithm for global optimization. Expert Systems with Applications, 90, 484-500.
Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Machine Learning: Proceedings of the Sixteenth International Conference 124–133.
Goldberg, D. E. Genetic algorithms in search, optimisation and machine learning, 1989. Reading, Addison, Wesley.
Gong, D., Sun, J., & Miao, Z. (2018). A set-based genetic algorithm for interval many-objective optimization problems. IEEE Transactions on Evolutionary Computation, 22(1), 47-60.
Gorzałczany, M. B., & Rudziński, F. (2018). Generalized self-organizing maps for automatic determination of the number of clusters and their multiprototypes in cluster analysis. IEEE Transactions on Neural Networks and Learning Systems, 29(7), 2833-2845.
Guha, S., Rastogi, R., & Shim, K. (1998, June). CURE: an efficient clustering algorithm for large databases. In ACM Sigmod Record (Vol. 27, No. 2, pp. 73-84). ACM.
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and Techniques. Elsevier.
Holland, J. (1975). Adaptation in natural and artificial systems: an introductory analysis with application to biology. Control and Artificial Intelligence.
Huang, Z. (1997, February). Clustering large data sets with mixed numeric and categorical values. In Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD) (pp. 21-34).
Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data (Vol. 6). Englewood Cliffs: Prentice hall.
Jia, H., & Cheung, Y. M. (2018). Subspace clustering of categorical and numerical data with an unknown number of clusters. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3308-3325.
Ji, J., Bai, T., Zhou, C., Ma, C., & Wang, Z. (2013). An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 120, 590-596.
Ji, M., Tang, H., & Guo, J. (2004). A single-point mutation evolutionary programming. Information Processing Letters, 90(6), 293-299.
Kohavi, R. (1996). Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (pp. 202–207). Portland, OR: AAAI Press.
Lam, D., Wei, M., & Wunsch, D. (2015). Clustering data of mixed categorical and numerical type with unsupervised feature learning. IEEE Access, 3, 1605-1613.
Landwehr, N., Hall, M., & Frank, E. (2005). Logistic model trees. Machine Learning, 59(1-2), 161-205.
Li, C., & Biswas, G. (2002). Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge & Data Engineering (4), 673-690.
Li, S., Fang, H., & Liu, X. (2018). Parameter optimization of support vector regression based on sine cosine algorithm. Expert Systems with Applications, 91, 63-77.
Lv, Z., Wang, L., Guan, Z., Wu, J., Du, X., Zhao, H., & Guizani, M. (2019). An optimizing and differentially private clustering algorithm for mixed data in sdn-based smart grid. IEEE Access, 7, 45773-45782.
Iwamatsu, M. (2002). Generalized evolutionary programming with Lévy-type mutation. Computer Physics Communications, 147(1-2), 729-732.
Mirjalili, S. (2016). SCA: a sine cosine algorithm for solving optimization problems. Knowledge-Based Systems, 96, 120-133.
Nenavath, H., & Jatoth, R. K. (2018). Hybridizing sine cosine algorithm with differential evolution for global optimization and object tracking. Applied Soft Computing, 62, 1019-1043.
Özbakır, L., & Turna, F. (2017). Clustering performance comparison of new generation meta-heuristic algorithms. Knowledge-Based Systems, 130, 1-16.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106.
Rahman, M. A., & Islam, M. Z. (2014). A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowledge-Based Systems, 71, 345-365.
Reddy, K. S., Panwar, L. K., Panigrahi, B. K., & Kumar, R. (2018). A New Binary Variant of Sine–Cosine Algorithm: Development and Application to Solve Profit-Based Unit Commitment Problem. Arabian Journal for Science and Engineering, 43(8), 4041-4056.
Sindhu, R., Ngadiran, R., Yacob, Y. M., Zahri, N. A. H., & Hariharan, M. (2017). Sine–cosine algorithm for feature selection with elitism strategy and new updating mechanism. Neural Computing and Applications, 28(10), 2947-2958.
UCI machine learning repository (2019) https://archive.ics.uci.edu/ml/datasets.php
Vapnik, V. (1998). The support vector method of function estimation. In Nonlinear Modeling (pp. 55-85). Springer, Boston, MA.
Wang, W., Li, Q., Han, S., & Lin, H. (2006, August). A preliminary study on constructing decision tree with gene expression programming. In First International Conference on Innovative Computing, Information and Control-Volume I (ICICIC'06) (Vol. 1, pp. 222-225). IEEE.
Yao, X., Liu, Y., & Lin, G. (1999). Evolutionary programming made faster. IEEE Transactions on Evolutionary Computation, 3(2), 82-102.

無法下載圖示 全文公開日期 2024/06/21 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE