研究生: |
蘇柏瑜 PO-YU SU |
---|---|
論文名稱: |
整合分群分析與粒化運算以處理資料不平衡之分類問題─以攝護腺癌症預後為例 Integrating Clustering Analysis with Granular Computing for Imbalanced Data Classification Problem─A Case Study on Prostate Cancer Prognosis |
指導教授: |
郭人介
Ren-Jieh Kuo |
口試委員: |
歐陽超
Chao Ou-Yang 蔡介元 Chieh-Yuan Tsai |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 工業管理系 Department of Industrial Management |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 英文 |
論文頁數: | 150 |
中文關鍵詞: | 預後 、攝護腺癌 、粒化運算 、粒子群最佳化動態分群法 、基因演算法為基礎之K平均數分群法 、人工蜂群演算法為基礎之K平均數分群法 、類別不平衡 、分類。 |
外文關鍵詞: | Prognosis, Prostate cancer, Granular computing, dynamic clustering using particle swarm optimiza, genetic algorithm K-means, artificial bee colony K-means, Class imbalance, Classification. |
相關次數: | 點閱:346 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究旨在應用資訊粒化(Information Granulation; IG)的概念以處理非平衡資料之分類問題,將多數類別性質相似的資料群聚成為粒子,進而平衡資料集中各類別的資料比例,減少關鍵的少數資料被大量的多數類別資料稀釋,此前處理過程能夠使分類演算法對於不平衡資料有更好分類結果。
本研究透過三種分群方法來建構資訊粒子,分別是:粒子群最佳化動態分群法(DCPSO)、基因演算法為基礎之K平均數分群法(GA K-means)以及人工蜂群演算法為基礎之K平均數分群法(ABC K-means)。因此,本研究提出三種粒化運算(Granular Computing; GrC)的模型來解決資料不平衡之問題,並且結合倒傳遞類神經網路(BPN)、決策樹(DT)以及支持向量機(SVM)三種分類方法建構分類模型。研究中所提出的粒化模型經由UCI資料庫中的標竿資料集進行驗證,皆能夠有效的對不平衡資料進行前處理,因此,攝護腺癌症病患存活年限的實際資料被運用於預後系統的建立,而其分類結果也有相當的提升。
本研究的結果顯示,所提出的粒化運算模型能降低不平衡資料分類的難度,同時,顯著的提升資料集中少數類別的分類正確率以及大多數的整體分類正確率,攝護腺癌症預後的有效分析也能夠提供醫生更準確的資訊來幫助攝護腺病患,以有限的病理數據對存活狀況做出更佳的判斷。
This study aims to deal with the class imbalance problem by using the concept of Information Granulation (IG). Majority classes of data are assembled into granules to balance the ratio of classes within data. This process can reduce the risk of critical information being diluted by large numbers of relatively unimportant data and noises.
Three clustering techniques, dynamic clustering using particle swarm optimization (DCPSO), genetic algorithm K-means (GA K-means), and artificial bee colony K-means (ABC K-means) are implemented to construct information granules. Thus, three granular computing (GrC) models are proposed in this study in order to solve the problem of class imbalance. At the end of the procedure, classifiers are applied to construct the classification models for each data. With the help of benchmark data sets on UCI Machine Learning Repository, the effectiveness of proposed GrC models have been evaluated. Since the proposed models have the ability to produce solid results of classification, real world data for survival length of patients with prostate cancer were used implemented to construct a prognosis system. The classification results are also very promising. The results indicate that the proposed GrC models are capable of reducing the difficulties of classification for imbalanced data. Furthermore, the proposed GrC models truly help raise the accuracies of minorities and most of the overall accuracies. Computational results of prostate cancer prognosis give the doctors better information and analysis for the patients’ survival conditions of prostate cancer.
Adhau, S., Moharil, R., & Adhau, P., "K-Means clustering technique applied to availability of micro hydro power," Sustainable Energy Technologies and Assessments, vol. 8, pp. 191-201, 2014.
Arabie, P., Hubert, L. J., & Carroll, J. D., "Clustering," in Encyclopedia of Social Measurement, K. Kempf-Leonard, Ed., ed New York: Elsevier, 2005, pp. 317-320.
Bargiela, A. & Pedrycz, W., Granular computing: an introduction: Springer Science & Business Media, 2003.
Batista, G. E., Prati, R. C., & Monard, M. C., "A study of the behavior of several methods for balancing machine learning training data," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 20-29, 2004.
Benardos, P. & Vosniakos, G.-C., "Optimizing feedforward artificial neural network architecture," Engineering Applications of Artificial Intelligence, vol. 20, no. 3, pp. 365-382, 2007.
Bodjanova, S., "Granulation of a fuzzy set: Nonspecificity," Information Sciences, vol. 177, no. 20, pp. 4430-4444, 2007.
Brown, M., Gunn, S. R., & Lewis, H. G., "Support vector machines for optimal classification and spectral unmixing," Ecological Modelling, vol. 120, no. 2, pp. 167-179, 1999.
Bryson, A. & Ho, Y.-C., "Applied optimal control," Blaisdell, Waltham, Mass, vol. 8, 1969.
Chaturvedi, A., Carroll, J. D., Green, P. E., & Rotondo, J. A., "A feature-based approach to market segmentation via overlapping K-centroids clustering," Journal of Marketing Research, pp. 370-377, 1997.
Chawla, N. V., Cieslak, D. A., Hall, L. O., & Joshi, A., "Automatically countering imbalance and its empirical relationship to cost," Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 225-252, 2008.
Chawla, N. V., Japkowicz, N., & Kotcz, A., "Editorial: special issue on learning from imbalanced data sets," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 1-6, 2004.
Chen, M.-C., Chen, L.-S., Hsu, C.-C., & Zeng, W.-R., "An information granulation based data mining approach for classifying imbalanced data," Information Sciences, vol. 178, no. 16, pp. 3214-3227, 2008.
Çınar, M., Engin, M., Engin, E. Z., & Ateşçi, Y. Z., "Early prostate cancer diagnosis by using artificial neural networks and support vector machines," Expert Systems with Applications, vol. 36, no. 3, pp. 6357-6361, 2009.
Cortes, C. & Vapnik, V., "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
Dagliyan, O., Uney-Yuksektepe, F., Kavakli, I. H., & Turkay, M., "Optimization based tumor classification from microarray gene expression data," PloS one, vol. 6, no. 2, p. e14579, 2011.
Das, S., Abraham, A., & Konar, A., "Automatic clustering using an improved differential evolution algorithm," Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, vol. 38, no. 1, pp. 218-237, 2008.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A., "Indexing by latent semantic analysis," JAsIs, vol. 41, no. 6, pp. 391-407, 1990.
Du, W. & Zhan, Z., "Building decision tree classifier on private data," in Proceedings of the IEEE international conference on Privacy, security and data mining-Volume 14, 2002, pp. 1-8.
Friedl, M. A., Brodley, C. E., & Strahler, A. H., "Maximizing land cover classification accuracies produced by decision trees at continental to global scales," Geoscience and Remote Sensing, IEEE Transactions on, vol. 37, no. 2, pp. 969-977, 1999.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F., "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42, no. 4, pp. 463-484, 2012.
Galathiya, A., Ganatra, A., & Bhensdadia, C., "Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, Model Complexity and Reduced Error Pruning," International Journal of Computer Science and Information Technologies, vol. 3, no. 2, pp. 3427-3431, 2012.
Han, J. & Kamber, M., "Data mining concept and technology," Publishing House of Mechanism Industry, pp. 70-72, 2001.
He, H. & Tan, Y., "A two-stage genetic algorithm for automatic clustering," Neurocomputing, vol. 81, pp. 49-59, 2012.
Hearst, M. A., Dumais, S. T., Osman, E., Platt, J., & Scholkopf, B., "Support vector machines," Intelligent Systems and their Applications, IEEE, vol. 13, no. 4, pp. 18-28, 1998.
Hsu, C.-W., Chang, C.-C., & Lin, C.-J., "A practical guide to support vector classification," ed, 2003.
Jiawei, H. & Kamber, M., "Data mining: concepts and techniques," San Francisco, CA, itd: Morgan Kaufmann, vol. 5, 2001.
Karaboga, D., "An idea based on honey bee swarm for numerical optimization," Technical report-tr06, Erciyes university, engineering faculty, computer engineering department2005.
Keles, A., Hasiloglu, A. S., Keles, A., & Aksoy, Y., "Neuro-fuzzy classification of prostate cancer using NEFCLASS-J," Computers in Biology and Medicine, vol. 37, no. 11, pp. 1617-1628, 2007.
Kotu, V. & Deshpande, B., "Chapter 7 - Clustering," in Predictive Analytics and Data Mining, V. K. Deshpande, Ed., ed Boston: Morgan Kaufmann, 2015, pp. 217-255.
Krishna, K. & Murty, M. N., "Genetic K-means algorithm," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 29, no. 3, pp. 433-439, 1999.
Kuhn, M. & Johnson, K., Applied predictive modeling: Springer, 2013.
Kuo, R.-J., Wang, M., & Huang, T., "An application of particle swarm optimization algorithm to clustering analysis," Soft Computing, vol. 15, no. 3, pp. 533-542, 2011.
Kuo, R., Syu, Y., Chen, Z.-Y., & Tien, F.-C., "Integration of particle swarm optimization and genetic algorithm for dynamic clustering," Information Sciences, vol. 195, pp. 124-140, 2012.
Kuo, R. J., An, Y. L., Wang, H. S., & Chung, W. J., "Integration of self-organizing feature maps neural network and genetic K-means algorithm for market segmentation," Expert Systems with Applications, vol. 30, no. 2, pp. 313-324, 2006.
Li, Y., Sun, G., & Zhu, Y., "Data imbalance problem in text classification," in Proceedings of the 2010 Third International Symposium on Information Processing, 2010, pp. 301-305.
Lin, L., "Integration of Particle Swarm K-means Optimization Algorithm and Granular Computing for Imbalanced Data Classification Problem- A Case Study on Prostate Cancer Prognosis," 2013.
MacQueen, J., "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, pp. 281-297.
Maulik, U. & Bandyopadhyay, S., "Genetic algorithm-based clustering technique," Pattern recognition, vol. 33, no. 9, pp. 1455-1465, 2000.
Mena, L. & Gonzalez, J. A., "Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic," in FLAIRS Conference, 2006, pp. 574-579.
Mingers, J., "An empirical comparison of pruning methods for decision tree induction," Machine learning, vol. 4, no. 2, pp. 227-243, 1989.
Omran, M. G., Salman, A., & Engelbrecht, A. P., "Dynamic clustering using particle swarm optimization with application in image segmentation," Pattern Analysis and Applications, vol. 8, no. 4, pp. 332-344, 2006.
Parker, D. B., "Learning logic," 1985.
Pawlak, Z. & Skowron, A., "Rudiments of rough sets," Information sciences, vol. 177, no. 1, pp. 3-27, 2007.
Pedrycz, W. & Bargiela, A., "Granular clustering: a granular signature of data," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 32, no. 2, pp. 212-224, 2002.
Perols, J. L. & Lougee, B. A., "The relation between earnings management and financial statement fraud," Advances in Accounting, vol. 27, no. 1, pp. 39-53, 2011.
Quinlan, J. R., C4. 5: programs for machine learning: Elsevier, 2014.
Ro, J. Y., Shen, S. S., Zhai, Q. J., & Ayala, A. G., Advances in Surgical Pathology: Prostate Cancer: Wolters Kluwer Health, 2012.
Rokach, L., Data mining with decision trees: theory and applications: World scientific, 2007.
Roobol, M. J., van Vugt, H. A., Loeb, S., Zhu, X., Bul, M., Bangma, C. H., et al., "Prediction of prostate cancer risk: the role of prostate volume and digital rectal examination in the ERSPC risk calculators," European urology, vol. 61, no. 3, pp. 577-583, 2012.
Rosenblatt, F., "The perceptron: a probabilistic model for information storage and organization in the brain," Psychological review, vol. 65, no. 6, p. 386, 1958.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J., "Learning internal representations by error propagation," DTIC Document1985.
Shah, S. & Kusiak, A., "Cancer gene search with data-mining and genetic algorithms," Comput Biol Med, vol. 37, no. 2, pp. 251-61, 2007.
Stenman, U.-H., Leinonen, J., Zhang, W.-M., & Finne, P., "Prostate-specific antigen," Seminars in Cancer Biology, vol. 9, no. 2, pp. 83-93, 1999.
Su, C.-T., Chen, L.-S., & Chiang, T.-L., "A neural network based information granulation approach to shorten the cellular phone test process," Computers in Industry, vol. 57, no. 5, pp. 412-423, 2006.
Su, C.-T., Chen, L.-S., & Yih, Y., "Knowledge acquisition through information granulation for imbalanced data," Expert Systems with applications, vol. 31, no. 3, pp. 531-541, 2006.
Tan, P.-N., Steinbach, M., & Kumar, V., Introduction to data mining vol. 1: Pearson Addison Wesley Boston, 2006.
Van der Merwe, D. & Engelbrecht, A. P., "Data clustering using particle swarm optimization," in Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, 2003, pp. 215-220.
Vapnik, V. N., Statistical Learning Theory: Wiley-Interscience, 1998.
Werbos, P., "Beyond regression: New tools for prediction and analysis in the behavioral sciences," 1974.
Wu, G. & Chang, E. Y., "KBA: Kernel boundary alignment considering imbalanced data distribution," Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 6, pp. 786-795, 2005.
Yao, Y., "Granular computing: basic issues and possible solutions," in Proceedings of the 5th Joint Conference on Information Sciences, 2000, pp. 186-189.
Zadeh, L. A., "Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems," Soft Computing-A fusion of foundations, methodologies and applications, vol. 2, no. 1, pp. 23-25, 1998.
Zhang, C., Ouyang, D., & Ning, J., "An artificial bee colony approach for clustering," Expert Systems with Applications, vol. 37, no. 7, pp. 4761-4767, 2010.
Zheng, Z., Wu, X., & Srihari, R., "Feature selection for text categorization on imbalanced data," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 80-89, 2004.
Zhu, W., "Generalized rough sets based on relations," Information Sciences, vol. 177, no. 22, pp. 4997-5011, 2007.
Zhu, W., "Topological approaches to covering rough sets," Information sciences, vol. 177, no. 6, pp. 1499-1508, 2007.