簡易檢索 / 詳目顯示

研究生: 蘇柏瑜
PO-YU SU
論文名稱: 整合分群分析與粒化運算以處理資料不平衡之分類問題─以攝護腺癌症預後為例
Integrating Clustering Analysis with Granular Computing for Imbalanced Data Classification Problem─A Case Study on Prostate Cancer Prognosis
指導教授: 郭人介
Ren-Jieh Kuo
口試委員: 歐陽超
Chao Ou-Yang
蔡介元
Chieh-Yuan Tsai
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2015
畢業學年度: 103
語文別: 英文
論文頁數: 150
中文關鍵詞: 預後攝護腺癌粒化運算粒子群最佳化動態分群法基因演算法為基礎之K平均數分群法人工蜂群演算法為基礎之K平均數分群法類別不平衡分類。
外文關鍵詞: Prognosis, Prostate cancer, Granular computing, dynamic clustering using particle swarm optimiza, genetic algorithm K-means, artificial bee colony K-means, Class imbalance, Classification.
相關次數: 點閱:346下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究旨在應用資訊粒化(Information Granulation; IG)的概念以處理非平衡資料之分類問題,將多數類別性質相似的資料群聚成為粒子,進而平衡資料集中各類別的資料比例,減少關鍵的少數資料被大量的多數類別資料稀釋,此前處理過程能夠使分類演算法對於不平衡資料有更好分類結果。
    本研究透過三種分群方法來建構資訊粒子,分別是:粒子群最佳化動態分群法(DCPSO)、基因演算法為基礎之K平均數分群法(GA K-means)以及人工蜂群演算法為基礎之K平均數分群法(ABC K-means)。因此,本研究提出三種粒化運算(Granular Computing; GrC)的模型來解決資料不平衡之問題,並且結合倒傳遞類神經網路(BPN)、決策樹(DT)以及支持向量機(SVM)三種分類方法建構分類模型。研究中所提出的粒化模型經由UCI資料庫中的標竿資料集進行驗證,皆能夠有效的對不平衡資料進行前處理,因此,攝護腺癌症病患存活年限的實際資料被運用於預後系統的建立,而其分類結果也有相當的提升。
    本研究的結果顯示,所提出的粒化運算模型能降低不平衡資料分類的難度,同時,顯著的提升資料集中少數類別的分類正確率以及大多數的整體分類正確率,攝護腺癌症預後的有效分析也能夠提供醫生更準確的資訊來幫助攝護腺病患,以有限的病理數據對存活狀況做出更佳的判斷。


    This study aims to deal with the class imbalance problem by using the concept of Information Granulation (IG). Majority classes of data are assembled into granules to balance the ratio of classes within data. This process can reduce the risk of critical information being diluted by large numbers of relatively unimportant data and noises.
    Three clustering techniques, dynamic clustering using particle swarm optimization (DCPSO), genetic algorithm K-means (GA K-means), and artificial bee colony K-means (ABC K-means) are implemented to construct information granules. Thus, three granular computing (GrC) models are proposed in this study in order to solve the problem of class imbalance. At the end of the procedure, classifiers are applied to construct the classification models for each data. With the help of benchmark data sets on UCI Machine Learning Repository, the effectiveness of proposed GrC models have been evaluated. Since the proposed models have the ability to produce solid results of classification, real world data for survival length of patients with prostate cancer were used implemented to construct a prognosis system. The classification results are also very promising. The results indicate that the proposed GrC models are capable of reducing the difficulties of classification for imbalanced data. Furthermore, the proposed GrC models truly help raise the accuracies of minorities and most of the overall accuracies. Computational results of prostate cancer prognosis give the doctors better information and analysis for the patients’ survival conditions of prostate cancer.

    ABSTRACT II ACKNOWLEDGEMENTS III CONTENTS IV LIST OF TABLES VII LIST OF FIGURES IX CHAPTER 1 INTRODUCTION 1 1.1 Research Background 1 1.2 Research Objectives 1 1.3 Research Scopes and Constraints 3 1.4 Framework and Organization 3 CHAPTER 2 LITERATURE SURVEY 5 2.1 Prostate Cancer 5 2.1.1 Data Mining in Classification of Prostate Cancer 5 2.1.2 Critical Factors of Prostate Cancer 6 2.2 Classification 9 2.2.1 Decision Tree 9 2.2.2 Artificial Neural Network 9 2.2.3 Support Vector Machine 10 2.3 Class Imbalance Problems 12 2.4 Cluster Analysis 13 2.4.1 Clustering Techniques 14 2.4.2 K-means Algorithm 15 2.4.3 Meta-heuristic-based K-means Clustering Methods 16 2.4.4 Automatic Clustering Methods 17 2.5 Granular Computing 18 CHAPTER 3 RESEARCH METHODOLOGY 19 3.1 Research Framework 19 3.2 Construction of Information Granules 23 3.2.1 Apply GA K-means in IG Process 23 3.2.2 Apply ABC K-means in IG Process 26 3.2.3 Apply DCPSO in IG Process 29 3.3 Selection of Granularity 32 3.4 Representation of Information Granules 32 3.5 Latent Semantic Indexing 35 3.6 Classification 37 3.6.1 Back-propagation Neural Network 37 3.6.2 Decision Tree 40 3.6.3 Support Vector Machine 42 CHAPTER 4 EXPERIMENTAL RESULTS 44 4.1 Balanced Benchmark Results and Analysis 45 4.1.1 Overall Computational Results of Original Data 45 4.1.2 Computational Results of BPN 47 4.1.3 Computational Results of C5.0 49 4.1.4 Computational Results of SVM 51 4.2 Imbalanced Benchmark Results and Analysis 53 4.2.1 Overall Computational Results of Original Data 54 4.2.2 Computational Results of BPN 60 4.2.3 Computational Results of C5.0 62 4.2.4 Computational Results of SVM 64 4.3 Statistical Hypothesis 68 CHAPTER 5 MODEL EVALUATION RESULTS 77 5.1 Data Collection 77 5.2 Factor Selection- Stepwise Regression 77 5.3 Prognosis Trial 79 5.3.1 Computational Results of Original Data 80 5.3.2 Computational Results of Proposed Models 81 5.4 Statistical Hypothesis 85 CHAPTER 6 CONCLUSION AND FUTURE RESEARCH 90 6.1 Conclusion 90 6.2 Contributions 91 6.3 Future Research 92 REFERENCE 94 APPENDIX 99 Appendix I- BPN Computational Results of Balanced Benchmark (Original Data) 99 Appendix II- C5.0 Computational Results of Balanced Benchmark (Original Data) 100 Appendix III- SVM Computational Results of Balanced Benchmark (Original Data) 101 Appendix IV- BPN Computational Results of Imbalanced Benchmark (Original Data) 102 Appendix V- C5.0 Computational Results of Imbalanced Benchmark (Original Data) 104 Appendix VI- SVM Computational Results of Imbalanced Benchmark (Original Data) 106 Appendix VII- BPN Computational Results of Balanced Benchmark (Proposed Models) 108 Appendix VIII- C5.0 Computational Results of Balanced Benchmark (Proposed Models) 111 Appendix IX- SVM Computational Results of Balanced Benchmark (Proposed Models) 114 Appendix X- BPN Computational Results of Imbalanced Benchmark (Proposed Models) 117 Appendix XI- C5.0 Computational Results of Imbalanced Benchmark (Proposed Models) 123 Appendix XII- SVM Computational Results of Imbalanced Benchmark (Proposed Models) 129 Appendix XIII- The Result of BSWD and Glass using BPN 135 Appendix XIV- The Result of Car Evaluation and Pima using BPN 136 Appendix XV- The Result of BSWD and Glass using C5.0 137 Appendix XVI- The Result of Car Evaluation and Pima using C5.0 138 Appendix XVII- The Result of BSWD and Glass using SVM 139 Appendix XVIII- The Result of Car Evaluation and Pima using SVM 140 Appendix XIX- The Result of Prostate Cancer using BPN and C5.0 141 Appendix XX- The Result of Prostate Cancer using SVM 142 Appendix XXI- SVM Parameter Setting Grid Search Figures for Imbalanced Data 143 Appendix XXII- SVM Parameter Setting Grid Search Figures for Prostate Cancer 149

    Adhau, S., Moharil, R., & Adhau, P., "K-Means clustering technique applied to availability of micro hydro power," Sustainable Energy Technologies and Assessments, vol. 8, pp. 191-201, 2014.
    Arabie, P., Hubert, L. J., & Carroll, J. D., "Clustering," in Encyclopedia of Social Measurement, K. Kempf-Leonard, Ed., ed New York: Elsevier, 2005, pp. 317-320.
    Bargiela, A. & Pedrycz, W., Granular computing: an introduction: Springer Science & Business Media, 2003.
    Batista, G. E., Prati, R. C., & Monard, M. C., "A study of the behavior of several methods for balancing machine learning training data," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 20-29, 2004.
    Benardos, P. & Vosniakos, G.-C., "Optimizing feedforward artificial neural network architecture," Engineering Applications of Artificial Intelligence, vol. 20, no. 3, pp. 365-382, 2007.
    Bodjanova, S., "Granulation of a fuzzy set: Nonspecificity," Information Sciences, vol. 177, no. 20, pp. 4430-4444, 2007.
    Brown, M., Gunn, S. R., & Lewis, H. G., "Support vector machines for optimal classification and spectral unmixing," Ecological Modelling, vol. 120, no. 2, pp. 167-179, 1999.
    Bryson, A. & Ho, Y.-C., "Applied optimal control," Blaisdell, Waltham, Mass, vol. 8, 1969.
    Chaturvedi, A., Carroll, J. D., Green, P. E., & Rotondo, J. A., "A feature-based approach to market segmentation via overlapping K-centroids clustering," Journal of Marketing Research, pp. 370-377, 1997.
    Chawla, N. V., Cieslak, D. A., Hall, L. O., & Joshi, A., "Automatically countering imbalance and its empirical relationship to cost," Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 225-252, 2008.
    Chawla, N. V., Japkowicz, N., & Kotcz, A., "Editorial: special issue on learning from imbalanced data sets," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 1-6, 2004.
    Chen, M.-C., Chen, L.-S., Hsu, C.-C., & Zeng, W.-R., "An information granulation based data mining approach for classifying imbalanced data," Information Sciences, vol. 178, no. 16, pp. 3214-3227, 2008.
    Çınar, M., Engin, M., Engin, E. Z., & Ateşçi, Y. Z., "Early prostate cancer diagnosis by using artificial neural networks and support vector machines," Expert Systems with Applications, vol. 36, no. 3, pp. 6357-6361, 2009.
    Cortes, C. & Vapnik, V., "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
    Dagliyan, O., Uney-Yuksektepe, F., Kavakli, I. H., & Turkay, M., "Optimization based tumor classification from microarray gene expression data," PloS one, vol. 6, no. 2, p. e14579, 2011.
    Das, S., Abraham, A., & Konar, A., "Automatic clustering using an improved differential evolution algorithm," Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, vol. 38, no. 1, pp. 218-237, 2008.
    Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A., "Indexing by latent semantic analysis," JAsIs, vol. 41, no. 6, pp. 391-407, 1990.
    Du, W. & Zhan, Z., "Building decision tree classifier on private data," in Proceedings of the IEEE international conference on Privacy, security and data mining-Volume 14, 2002, pp. 1-8.
    Friedl, M. A., Brodley, C. E., & Strahler, A. H., "Maximizing land cover classification accuracies produced by decision trees at continental to global scales," Geoscience and Remote Sensing, IEEE Transactions on, vol. 37, no. 2, pp. 969-977, 1999.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F., "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42, no. 4, pp. 463-484, 2012.
    Galathiya, A., Ganatra, A., & Bhensdadia, C., "Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, Model Complexity and Reduced Error Pruning," International Journal of Computer Science and Information Technologies, vol. 3, no. 2, pp. 3427-3431, 2012.
    Han, J. & Kamber, M., "Data mining concept and technology," Publishing House of Mechanism Industry, pp. 70-72, 2001.
    He, H. & Tan, Y., "A two-stage genetic algorithm for automatic clustering," Neurocomputing, vol. 81, pp. 49-59, 2012.
    Hearst, M. A., Dumais, S. T., Osman, E., Platt, J., & Scholkopf, B., "Support vector machines," Intelligent Systems and their Applications, IEEE, vol. 13, no. 4, pp. 18-28, 1998.
    Hsu, C.-W., Chang, C.-C., & Lin, C.-J., "A practical guide to support vector classification," ed, 2003.
    Jiawei, H. & Kamber, M., "Data mining: concepts and techniques," San Francisco, CA, itd: Morgan Kaufmann, vol. 5, 2001.
    Karaboga, D., "An idea based on honey bee swarm for numerical optimization," Technical report-tr06, Erciyes university, engineering faculty, computer engineering department2005.
    Keles, A., Hasiloglu, A. S., Keles, A., & Aksoy, Y., "Neuro-fuzzy classification of prostate cancer using NEFCLASS-J," Computers in Biology and Medicine, vol. 37, no. 11, pp. 1617-1628, 2007.
    Kotu, V. & Deshpande, B., "Chapter 7 - Clustering," in Predictive Analytics and Data Mining, V. K. Deshpande, Ed., ed Boston: Morgan Kaufmann, 2015, pp. 217-255.
    Krishna, K. & Murty, M. N., "Genetic K-means algorithm," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 29, no. 3, pp. 433-439, 1999.
    Kuhn, M. & Johnson, K., Applied predictive modeling: Springer, 2013.
    Kuo, R.-J., Wang, M., & Huang, T., "An application of particle swarm optimization algorithm to clustering analysis," Soft Computing, vol. 15, no. 3, pp. 533-542, 2011.
    Kuo, R., Syu, Y., Chen, Z.-Y., & Tien, F.-C., "Integration of particle swarm optimization and genetic algorithm for dynamic clustering," Information Sciences, vol. 195, pp. 124-140, 2012.
    Kuo, R. J., An, Y. L., Wang, H. S., & Chung, W. J., "Integration of self-organizing feature maps neural network and genetic K-means algorithm for market segmentation," Expert Systems with Applications, vol. 30, no. 2, pp. 313-324, 2006.
    Li, Y., Sun, G., & Zhu, Y., "Data imbalance problem in text classification," in Proceedings of the 2010 Third International Symposium on Information Processing, 2010, pp. 301-305.
    Lin, L., "Integration of Particle Swarm K-means Optimization Algorithm and Granular Computing for Imbalanced Data Classification Problem- A Case Study on Prostate Cancer Prognosis," 2013.
    MacQueen, J., "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, pp. 281-297.
    Maulik, U. & Bandyopadhyay, S., "Genetic algorithm-based clustering technique," Pattern recognition, vol. 33, no. 9, pp. 1455-1465, 2000.
    Mena, L. & Gonzalez, J. A., "Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic," in FLAIRS Conference, 2006, pp. 574-579.
    Mingers, J., "An empirical comparison of pruning methods for decision tree induction," Machine learning, vol. 4, no. 2, pp. 227-243, 1989.
    Omran, M. G., Salman, A., & Engelbrecht, A. P., "Dynamic clustering using particle swarm optimization with application in image segmentation," Pattern Analysis and Applications, vol. 8, no. 4, pp. 332-344, 2006.
    Parker, D. B., "Learning logic," 1985.
    Pawlak, Z. & Skowron, A., "Rudiments of rough sets," Information sciences, vol. 177, no. 1, pp. 3-27, 2007.
    Pedrycz, W. & Bargiela, A., "Granular clustering: a granular signature of data," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 32, no. 2, pp. 212-224, 2002.
    Perols, J. L. & Lougee, B. A., "The relation between earnings management and financial statement fraud," Advances in Accounting, vol. 27, no. 1, pp. 39-53, 2011.
    Quinlan, J. R., C4. 5: programs for machine learning: Elsevier, 2014.
    Ro, J. Y., Shen, S. S., Zhai, Q. J., & Ayala, A. G., Advances in Surgical Pathology: Prostate Cancer: Wolters Kluwer Health, 2012.
    Rokach, L., Data mining with decision trees: theory and applications: World scientific, 2007.
    Roobol, M. J., van Vugt, H. A., Loeb, S., Zhu, X., Bul, M., Bangma, C. H., et al., "Prediction of prostate cancer risk: the role of prostate volume and digital rectal examination in the ERSPC risk calculators," European urology, vol. 61, no. 3, pp. 577-583, 2012.
    Rosenblatt, F., "The perceptron: a probabilistic model for information storage and organization in the brain," Psychological review, vol. 65, no. 6, p. 386, 1958.
    Rumelhart, D. E., Hinton, G. E., & Williams, R. J., "Learning internal representations by error propagation," DTIC Document1985.
    Shah, S. & Kusiak, A., "Cancer gene search with data-mining and genetic algorithms," Comput Biol Med, vol. 37, no. 2, pp. 251-61, 2007.
    Stenman, U.-H., Leinonen, J., Zhang, W.-M., & Finne, P., "Prostate-specific antigen," Seminars in Cancer Biology, vol. 9, no. 2, pp. 83-93, 1999.
    Su, C.-T., Chen, L.-S., & Chiang, T.-L., "A neural network based information granulation approach to shorten the cellular phone test process," Computers in Industry, vol. 57, no. 5, pp. 412-423, 2006.
    Su, C.-T., Chen, L.-S., & Yih, Y., "Knowledge acquisition through information granulation for imbalanced data," Expert Systems with applications, vol. 31, no. 3, pp. 531-541, 2006.
    Tan, P.-N., Steinbach, M., & Kumar, V., Introduction to data mining vol. 1: Pearson Addison Wesley Boston, 2006.
    Van der Merwe, D. & Engelbrecht, A. P., "Data clustering using particle swarm optimization," in Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, 2003, pp. 215-220.
    Vapnik, V. N., Statistical Learning Theory: Wiley-Interscience, 1998.
    Werbos, P., "Beyond regression: New tools for prediction and analysis in the behavioral sciences," 1974.
    Wu, G. & Chang, E. Y., "KBA: Kernel boundary alignment considering imbalanced data distribution," Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 6, pp. 786-795, 2005.
    Yao, Y., "Granular computing: basic issues and possible solutions," in Proceedings of the 5th Joint Conference on Information Sciences, 2000, pp. 186-189.
    Zadeh, L. A., "Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems," Soft Computing-A fusion of foundations, methodologies and applications, vol. 2, no. 1, pp. 23-25, 1998.
    Zhang, C., Ouyang, D., & Ning, J., "An artificial bee colony approach for clustering," Expert Systems with Applications, vol. 37, no. 7, pp. 4761-4767, 2010.
    Zheng, Z., Wu, X., & Srihari, R., "Feature selection for text categorization on imbalanced data," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 80-89, 2004.
    Zhu, W., "Generalized rough sets based on relations," Information Sciences, vol. 177, no. 22, pp. 4997-5011, 2007.
    Zhu, W., "Topological approaches to covering rough sets," Information sciences, vol. 177, no. 6, pp. 1499-1508, 2007.

    無法下載圖示 全文公開日期 2020/06/24 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE