Author: |
郭承諭 Chen-Yu Kuo |
---|---|
Thesis Title: |
結合K-Prototypes分群演算法與改良式正弦餘弦演算法於混合型資料分類之研究 A Hybrid K-Prototypes Clustering Approach with Improved Sine-Cosine Algorithm for Mixed Data Classification |
Advisor: |
王孔政
Kung-Jeng Wang |
Committee: |
歐陽超
Chao Ou-Yang 陳怡永 Yi-Yung Chen |
Degree: |
碩士 Master |
Department: |
管理學院 - 工業管理系 Department of Industrial Management |
Thesis Publication Year: | 2019 |
Graduation Academic Year: | 107 |
Language: | 英文 |
Pages: | 65 |
Keywords (in Chinese): | 分類 、分群 、混合型資料 、正弦餘弦演算法 、變異機制 |
Keywords (in other languages): | Classification, Clustering, Mixed data, Sine-cosine algorithm, Mutation |
Reference times: | Clicks: 830 Downloads: 0 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
當解決混合型資料(包含數值型資料和類別型資料)的分類問題時,既有監督式學習演算法無法表現完美,然而如K-prototypes演算法的非監督式學習在處理混合型資料卻展現優異的潛力。因此,為同時擁有分群與分類演算法之優點,本研究旨在發展出一個新穎之分群為基礎且專門處理混合型資料的分類演算法。本研究提出之方法應用正弦餘弦演算法以找到每一屬性之最佳權重與K-prototypes之最佳初始群心。對於正弦餘弦演算法而言,其目標函數為全部群心之純度累加。為了避免正弦餘弦演算法落入區域極值,本研究在此演算法中新增一個變異機制,其中包含高斯變異、柯西變異、雷維變異和單點變異。本研究與倒傳遞神經網路、支撐向量機、決策樹、隨機梯度下降法、k-近鄰演算法和高斯貝式 分類器等常用之分類演算法與不同萬用演算法(基因演算法和正弦餘弦演算法)進行比較,使用UCI機器學習資料庫,進行正確率、F值和Cohen's Kappa係數之比較,結果證實本研究之提出方法對混合型資料集有較優越的分類成效。
When dealing with mixed data with numerical and categorical types in a classification problem, supervised learning algorithms do not perform well, while some algorithms show, such as k-prototypes algorithm, show their potential in clustering. Thus, this study aims to develop a novel clustering-based classification algorithm for mixed data classification in order to have both merits of classification and clustering. The proposed algorithm employs a sine-cosine algorithm (SCA) to find attribute weights and initial centroids for a k-prototypes algorithm. The objective function of the algorithm is formulated as a sum-up purity. To increase the capacity of not being trapped in the local optimal solution for SCA, a mutation strategy, containing Gaussian mutation, Cauchy mutation, Levy mutation, and single-point mutation, is added in the original SCA. Comparing to popular classification algorithms (back propagation neural network, support vector machine, decision tree, stochastic gradient descent, k-nearest neighbors and Gaussian naive bayes) and difference metaheuristics to find the optimal initial centers and attribute weights in k-prototypes algorithm, including genetic algorithm and sine-cosine algorithm, by using benchmark data sets from UCI repository, the experimental results of the current study evidenced that the proposed algorithm achieves superior classification performance in terms of accuracy, F-measure and Cohen's Kappa.
Ashraf, M., Zaman, M., & Ahmed, M. (2019). To ameliorate classification accuracy using ensemble vote approach and base classifiers. In Emerging Technologies in Data Mining and Information Security (pp. 321-334). Springer, Singapore.
Askarzadeh, A. (2018). A memory-based genetic algorithm for optimization of power generation in a microgrid. IEEE Transactions on Sustainable Energy, 9(3), 1081-1089.
Audigier, V., Husson, F., & Josse, J. (2016). A principal component method to impute missing values for mixed data. Advances in Data Analysis and Classification, 10(1), 5-26.
Behzadi, S., Müller, N. S., Plant, C., & Böhm, C. (2019, April). Clustering of mixed-type data considering concept hierarchies. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 555-573). Springer, Cham.
Chiu, C., Chi, H., Sung, R., & Yuang, J. Y. (2010, November). The hybrid of genetic algorithms and k-prototypes clustering approach for classification. In 2010 International Conference on Technologies and Applications of Artificial Intelligence (pp. 327-330). IEEE.
Cunningham, P., Cord, M., & Delany, S. J. (2008). Supervised learning. In Machine Learning Techniques for Multimedia (pp. 21-49). Springer, Berlin, Heidelberg.
Darwin, C. (1872). The Origin of Species: By Means of Natural Selection or The Preservation of Favored Races In The Struggle For Life (Vol. 1). Modern library.
Das, S., Bhattacharya, A., & Chakraborty, A. K. (2018). Solution of short-term hydrothermal scheduling using sine cosine algorithm. Soft Computing, 22(19), 6409-6427
De Jong, K. A. (1975). Analysis of the behavior of a class of genetic adaptive systems. Ph.D. thesis, University of Michigan, Ann Arbor, MI. Dissertation Abstracts International 36(10), 5140B, University Microfilms Number 76-9381.
Elaziz, M. A., Oliva, D., & Xiong, S. (2017). An improved opposition-based sine cosine algorithm for global optimization. Expert Systems with Applications, 90, 484-500.
Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Machine Learning: Proceedings of the Sixteenth International Conference 124–133.
Goldberg, D. E. Genetic algorithms in search, optimisation and machine learning, 1989. Reading, Addison, Wesley.
Gong, D., Sun, J., & Miao, Z. (2018). A set-based genetic algorithm for interval many-objective optimization problems. IEEE Transactions on Evolutionary Computation, 22(1), 47-60.
Gorzałczany, M. B., & Rudziński, F. (2018). Generalized self-organizing maps for automatic determination of the number of clusters and their multiprototypes in cluster analysis. IEEE Transactions on Neural Networks and Learning Systems, 29(7), 2833-2845.
Guha, S., Rastogi, R., & Shim, K. (1998, June). CURE: an efficient clustering algorithm for large databases. In ACM Sigmod Record (Vol. 27, No. 2, pp. 73-84). ACM.
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and Techniques. Elsevier.
Holland, J. (1975). Adaptation in natural and artificial systems: an introductory analysis with application to biology. Control and Artificial Intelligence.
Huang, Z. (1997, February). Clustering large data sets with mixed numeric and categorical values. In Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD) (pp. 21-34).
Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data (Vol. 6). Englewood Cliffs: Prentice hall.
Jia, H., & Cheung, Y. M. (2018). Subspace clustering of categorical and numerical data with an unknown number of clusters. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3308-3325.
Ji, J., Bai, T., Zhou, C., Ma, C., & Wang, Z. (2013). An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 120, 590-596.
Ji, M., Tang, H., & Guo, J. (2004). A single-point mutation evolutionary programming. Information Processing Letters, 90(6), 293-299.
Kohavi, R. (1996). Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (pp. 202–207). Portland, OR: AAAI Press.
Lam, D., Wei, M., & Wunsch, D. (2015). Clustering data of mixed categorical and numerical type with unsupervised feature learning. IEEE Access, 3, 1605-1613.
Landwehr, N., Hall, M., & Frank, E. (2005). Logistic model trees. Machine Learning, 59(1-2), 161-205.
Li, C., & Biswas, G. (2002). Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge & Data Engineering (4), 673-690.
Li, S., Fang, H., & Liu, X. (2018). Parameter optimization of support vector regression based on sine cosine algorithm. Expert Systems with Applications, 91, 63-77.
Lv, Z., Wang, L., Guan, Z., Wu, J., Du, X., Zhao, H., & Guizani, M. (2019). An optimizing and differentially private clustering algorithm for mixed data in sdn-based smart grid. IEEE Access, 7, 45773-45782.
Iwamatsu, M. (2002). Generalized evolutionary programming with Lévy-type mutation. Computer Physics Communications, 147(1-2), 729-732.
Mirjalili, S. (2016). SCA: a sine cosine algorithm for solving optimization problems. Knowledge-Based Systems, 96, 120-133.
Nenavath, H., & Jatoth, R. K. (2018). Hybridizing sine cosine algorithm with differential evolution for global optimization and object tracking. Applied Soft Computing, 62, 1019-1043.
Özbakır, L., & Turna, F. (2017). Clustering performance comparison of new generation meta-heuristic algorithms. Knowledge-Based Systems, 130, 1-16.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106.
Rahman, M. A., & Islam, M. Z. (2014). A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowledge-Based Systems, 71, 345-365.
Reddy, K. S., Panwar, L. K., Panigrahi, B. K., & Kumar, R. (2018). A New Binary Variant of Sine–Cosine Algorithm: Development and Application to Solve Profit-Based Unit Commitment Problem. Arabian Journal for Science and Engineering, 43(8), 4041-4056.
Sindhu, R., Ngadiran, R., Yacob, Y. M., Zahri, N. A. H., & Hariharan, M. (2017). Sine–cosine algorithm for feature selection with elitism strategy and new updating mechanism. Neural Computing and Applications, 28(10), 2947-2958.
UCI machine learning repository (2019) https://archive.ics.uci.edu/ml/datasets.php
Vapnik, V. (1998). The support vector method of function estimation. In Nonlinear Modeling (pp. 55-85). Springer, Boston, MA.
Wang, W., Li, Q., Han, S., & Lin, H. (2006, August). A preliminary study on constructing decision tree with gene expression programming. In First International Conference on Innovative Computing, Information and Control-Volume I (ICICIC'06) (Vol. 1, pp. 222-225). IEEE.
Yao, X., Liu, Y., & Lin, G. (1999). Evolutionary programming made faster. IEEE Transactions on Evolutionary Computation, 3(2), 82-102.