整合分群分析與粒化運算以處理資料不平衡之分類問題─以攝護腺癌症預後為例

簡易檢索 / 詳目顯示

回結果列表

研究生：	蘇柏瑜 PO-YU SU
論文名稱：	整合分群分析與粒化運算以處理資料不平衡之分類問題─以攝護腺癌症預後為例 Integrating Clustering Analysis with Granular Computing for Imbalanced Data Classification Problem─A Case Study on Prostate Cancer Prognosis
指導教授：	郭人介 Ren-Jieh Kuo
口試委員:	歐陽超 Chao Ou-Yang 蔡介元 Chieh-Yuan Tsai
學位類別：	碩士 Master
系所名稱：	管理學院 - 工業管理系 Department of Industrial Management
論文出版年：	2015
畢業學年度：	103
語文別：	英文
論文頁數：	150
中文關鍵詞：	預後、攝護腺癌、粒化運算、粒子群最佳化動態分群法、基因演算法為基礎之K平均數分群法、人工蜂群演算法為基礎之K平均數分群法、類別不平衡、分類。
外文關鍵詞：	Prognosis, Prostate cancer, Granular computing, dynamic clustering using particle swarm optimiza, genetic algorithm K-means, artificial bee colony K-means, Class imbalance, Classification.
相關次數：	點閱：346 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本研究旨在應用資訊粒化(Information Granulation; IG)的概念以處理非平衡資料之分類問題，將多數類別性質相似的資料群聚成為粒子，進而平衡資料集中各類別的資料比例，減少關鍵的少數資料被大量的多數類別資料稀釋，此前處理過程能夠使分類演算法對於不平衡資料有更好分類結果。
本研究透過三種分群方法來建構資訊粒子，分別是：粒子群最佳化動態分群法(DCPSO)、基因演算法為基礎之K平均數分群法(GA K-means)以及人工蜂群演算法為基礎之K平均數分群法(ABC K-means)。因此，本研究提出三種粒化運算(Granular Computing; GrC)的模型來解決資料不平衡之問題，並且結合倒傳遞類神經網路(BPN)、決策樹(DT)以及支持向量機(SVM)三種分類方法建構分類模型。研究中所提出的粒化模型經由UCI資料庫中的標竿資料集進行驗證，皆能夠有效的對不平衡資料進行前處理，因此，攝護腺癌症病患存活年限的實際資料被運用於預後系統的建立，而其分類結果也有相當的提升。
本研究的結果顯示，所提出的粒化運算模型能降低不平衡資料分類的難度，同時，顯著的提升資料集中少數類別的分類正確率以及大多數的整體分類正確率，攝護腺癌症預後的有效分析也能夠提供醫生更準確的資訊來幫助攝護腺病患，以有限的病理數據對存活狀況做出更佳的判斷。

This study aims to deal with the class imbalance problem by using the concept of Information Granulation (IG). Majority classes of data are assembled into granules to balance the ratio of classes within data. This process can reduce the risk of critical information being diluted by large numbers of relatively unimportant data and noises.
Three clustering techniques, dynamic clustering using particle swarm optimization (DCPSO), genetic algorithm K-means (GA K-means), and artificial bee colony K-means (ABC K-means) are implemented to construct information granules. Thus, three granular computing (GrC) models are proposed in this study in order to solve the problem of class imbalance. At the end of the procedure, classifiers are applied to construct the classification models for each data. With the help of benchmark data sets on UCI Machine Learning Repository, the effectiveness of proposed GrC models have been evaluated. Since the proposed models have the ability to produce solid results of classification, real world data for survival length of patients with prostate cancer were used implemented to construct a prognosis system. The classification results are also very promising. The results indicate that the proposed GrC models are capable of reducing the difficulties of classification for imbalanced data. Furthermore, the proposed GrC models truly help raise the accuracies of minorities and most of the overall accuracies. Computational results of prostate cancer prognosis give the doctors better information and analysis for the patients’ survival conditions of prostate cancer.

ABSTRACT	II
ACKNOWLEDGEMENTS	III
CONTENTS	IV
LIST OF TABLES	VII
LIST OF FIGURES	IX
CHAPTER 1 INTRODUCTION	1
1.1	Research Background	1
1.2	Research Objectives	1
1.3	Research Scopes and Constraints	3
1.4	Framework and Organization	3
CHAPTER 2  LITERATURE SURVEY	5
2.1	Prostate Cancer	5
2.1.1	Data Mining in Classification of Prostate Cancer	5
2.1.2	Critical Factors of Prostate Cancer	6
2.2	Classification	9
2.2.1	Decision Tree	9
2.2.2	Artificial Neural Network	9
2.2.3	Support Vector Machine	10
2.3	Class Imbalance Problems	12
2.4	Cluster Analysis	13
2.4.1	Clustering Techniques	14
2.4.2	K-means Algorithm	15
2.4.3	Meta-heuristic-based K-means Clustering Methods	16
2.4.4	Automatic Clustering Methods	17
2.5	Granular Computing	18
CHAPTER 3 RESEARCH METHODOLOGY	19
3.1	Research Framework	19
3.2	Construction of Information Granules	23
3.2.1	Apply GA K-means in IG Process	23
3.2.2	Apply ABC K-means in IG Process	26
3.2.3	Apply DCPSO in IG Process	29
3.3	Selection of Granularity	32
3.4	Representation of Information Granules	32
3.5	Latent Semantic Indexing	35
3.6	Classification	37
3.6.1	Back-propagation Neural Network	37
3.6.2	Decision Tree	40
3.6.3	Support Vector Machine	42
CHAPTER 4 EXPERIMENTAL RESULTS	44
4.1	Balanced Benchmark Results and Analysis	45
4.1.1	Overall Computational Results of Original Data	45
4.1.2	Computational Results of BPN	47
4.1.3	Computational Results of C5.0	49
4.1.4	Computational Results of SVM	51
4.2	Imbalanced Benchmark Results and Analysis	53
4.2.1	Overall Computational Results of Original Data	54
4.2.2	Computational Results of BPN	60
4.2.3	Computational Results of C5.0	62
4.2.4	Computational Results of SVM	64
4.3	Statistical Hypothesis	68
CHAPTER 5 MODEL EVALUATION RESULTS	77
5.1	Data Collection	77
5.2	Factor Selection- Stepwise Regression	77
5.3	Prognosis Trial	79
5.3.1	Computational Results of Original Data	80
5.3.2	Computational Results of Proposed Models	81
5.4	Statistical Hypothesis	85
CHAPTER 6 CONCLUSION AND FUTURE RESEARCH	90
6.1	Conclusion	90
6.2	Contributions	91
6.3	Future Research	92
REFERENCE	94
APPENDIX	99
Appendix I- BPN Computational Results of Balanced Benchmark (Original Data)	99
Appendix II- C5.0 Computational Results of Balanced Benchmark (Original Data)	100
Appendix III- SVM Computational Results of Balanced Benchmark (Original Data)	101
Appendix IV- BPN Computational Results of Imbalanced Benchmark (Original Data)	102
Appendix V- C5.0 Computational Results of Imbalanced Benchmark (Original Data)	104
Appendix VI- SVM Computational Results of Imbalanced Benchmark (Original Data)	106
Appendix VII- BPN Computational Results of Balanced Benchmark (Proposed Models)	108
Appendix VIII- C5.0 Computational Results of Balanced Benchmark (Proposed Models)	111
Appendix IX- SVM Computational Results of Balanced Benchmark (Proposed Models)	114
Appendix X- BPN Computational Results of Imbalanced Benchmark (Proposed Models)	117
Appendix XI- C5.0 Computational Results of Imbalanced Benchmark (Proposed Models)	123
Appendix XII- SVM Computational Results of Imbalanced Benchmark (Proposed Models)	129
Appendix XIII- The Result of BSWD and Glass using BPN	135
Appendix XIV- The Result of Car Evaluation and Pima using BPN	136
Appendix XV- The Result of BSWD and Glass using C5.0	137
Appendix XVI- The Result of Car Evaluation and Pima using C5.0	138
Appendix XVII- The Result of BSWD and Glass using SVM	139
Appendix XVIII- The Result of Car Evaluation and Pima using SVM	140
Appendix XIX- The Result of Prostate Cancer using BPN and C5.0	141
Appendix XX- The Result of Prostate Cancer using SVM	142
Appendix XXI- SVM Parameter Setting Grid Search Figures for Imbalanced Data	143
Appendix XXII- SVM Parameter Setting Grid Search Figures for Prostate Cancer	149

                                

Adhau, S., Moharil, R., & Adhau, P., "K-Means clustering technique applied to availability of micro hydro power," Sustainable Energy Technologies and Assessments, vol. 8, pp. 191-201, 2014.
Arabie, P., Hubert, L. J., & Carroll, J. D., "Clustering," in Encyclopedia of Social Measurement, K. Kempf-Leonard, Ed., ed New York: Elsevier, 2005, pp. 317-320.
Bargiela, A. & Pedrycz, W., Granular computing: an introduction: Springer Science & Business Media, 2003.
Batista, G. E., Prati, R. C., & Monard, M. C., "A study of the behavior of several methods for balancing machine learning training data," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 20-29, 2004.
Benardos, P. & Vosniakos, G.-C., "Optimizing feedforward artificial neural network architecture," Engineering Applications of Artificial Intelligence, vol. 20, no. 3, pp. 365-382, 2007.
Bodjanova, S., "Granulation of a fuzzy set: Nonspecificity," Information Sciences, vol. 177, no. 20, pp. 4430-4444, 2007.
Brown, M., Gunn, S. R., & Lewis, H. G., "Support vector machines for optimal classification and spectral unmixing," Ecological Modelling, vol. 120, no. 2, pp. 167-179, 1999.
Bryson, A. & Ho, Y.-C., "Applied optimal control," Blaisdell, Waltham, Mass, vol. 8, 1969.
Chaturvedi, A., Carroll, J. D., Green, P. E., & Rotondo, J. A., "A feature-based approach to market segmentation via overlapping K-centroids clustering," Journal of Marketing Research, pp. 370-377, 1997.
Chawla, N. V., Cieslak, D. A., Hall, L. O., & Joshi, A., "Automatically countering imbalance and its empirical relationship to cost," Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 225-252, 2008.
Chawla, N. V., Japkowicz, N., & Kotcz, A., "Editorial: special issue on learning from imbalanced data sets," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 1-6, 2004.
Chen, M.-C., Chen, L.-S., Hsu, C.-C., & Zeng, W.-R., "An information granulation based data mining approach for classifying imbalanced data," Information Sciences, vol. 178, no. 16, pp. 3214-3227, 2008.
Çınar, M., Engin, M., Engin, E. Z., & Ateşçi, Y. Z., "Early prostate cancer diagnosis by using artificial neural networks and support vector machines," Expert Systems with Applications, vol. 36, no. 3, pp. 6357-6361, 2009.
Cortes, C. & Vapnik, V., "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
Dagliyan, O., Uney-Yuksektepe, F., Kavakli, I. H., & Turkay, M., "Optimization based tumor classification from microarray gene expression data," PloS one, vol. 6, no. 2, p. e14579, 2011.
Das, S., Abraham, A., & Konar, A., "Automatic clustering using an improved differential evolution algorithm," Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, vol. 38, no. 1, pp. 218-237, 2008.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A., "Indexing by latent semantic analysis," JAsIs, vol. 41, no. 6, pp. 391-407, 1990.
Du, W. & Zhan, Z., "Building decision tree classifier on private data," in Proceedings of the IEEE international conference on Privacy, security and data mining-Volume 14, 2002, pp. 1-8.
Friedl, M. A., Brodley, C. E., & Strahler, A. H., "Maximizing land cover classification accuracies produced by decision trees at continental to global scales," Geoscience and Remote Sensing, IEEE Transactions on, vol. 37, no. 2, pp. 969-977, 1999.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F., "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42, no. 4, pp. 463-484, 2012.
Galathiya, A., Ganatra, A., & Bhensdadia, C., "Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, Model Complexity and Reduced Error Pruning," International Journal of Computer Science and Information Technologies, vol. 3, no. 2, pp. 3427-3431, 2012.
Han, J. & Kamber, M., "Data mining concept and technology," Publishing House of Mechanism Industry, pp. 70-72, 2001.
He, H. & Tan, Y., "A two-stage genetic algorithm for automatic clustering," Neurocomputing, vol. 81, pp. 49-59, 2012.
Hearst, M. A., Dumais, S. T., Osman, E., Platt, J., & Scholkopf, B., "Support vector machines," Intelligent Systems and their Applications, IEEE, vol. 13, no. 4, pp. 18-28, 1998.
Hsu, C.-W., Chang, C.-C., & Lin, C.-J., "A practical guide to support vector classification," ed, 2003.
Jiawei, H. & Kamber, M., "Data mining: concepts and techniques," San Francisco, CA, itd: Morgan Kaufmann, vol. 5, 2001.
Karaboga, D., "An idea based on honey bee swarm for numerical optimization," Technical report-tr06, Erciyes university, engineering faculty, computer engineering department2005.
Keles, A., Hasiloglu, A. S., Keles, A., & Aksoy, Y., "Neuro-fuzzy classification of prostate cancer using NEFCLASS-J," Computers in Biology and Medicine, vol. 37, no. 11, pp. 1617-1628, 2007.
Kotu, V. & Deshpande, B., "Chapter 7 - Clustering," in Predictive Analytics and Data Mining, V. K. Deshpande, Ed., ed Boston: Morgan Kaufmann, 2015, pp. 217-255.
Krishna, K. & Murty, M. N., "Genetic K-means algorithm," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 29, no. 3, pp. 433-439, 1999.
Kuhn, M. & Johnson, K., Applied predictive modeling: Springer, 2013.
Kuo, R.-J., Wang, M., & Huang, T., "An application of particle swarm optimization algorithm to clustering analysis," Soft Computing, vol. 15, no. 3, pp. 533-542, 2011.
Kuo, R., Syu, Y., Chen, Z.-Y., & Tien, F.-C., "Integration of particle swarm optimization and genetic algorithm for dynamic clustering," Information Sciences, vol. 195, pp. 124-140, 2012.
Kuo, R. J., An, Y. L., Wang, H. S., & Chung, W. J., "Integration of self-organizing feature maps neural network and genetic K-means algorithm for market segmentation," Expert Systems with Applications, vol. 30, no. 2, pp. 313-324, 2006.
Li, Y., Sun, G., & Zhu, Y., "Data imbalance problem in text classification," in Proceedings of the 2010 Third International Symposium on Information Processing, 2010, pp. 301-305.
Lin, L., "Integration of Particle Swarm K-means Optimization Algorithm and Granular Computing for Imbalanced Data Classification Problem- A Case Study on Prostate Cancer Prognosis," 2013.
MacQueen, J., "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, pp. 281-297.
Maulik, U. & Bandyopadhyay, S., "Genetic algorithm-based clustering technique," Pattern recognition, vol. 33, no. 9, pp. 1455-1465, 2000.
Mena, L. & Gonzalez, J. A., "Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic," in FLAIRS Conference, 2006, pp. 574-579.
Mingers, J., "An empirical comparison of pruning methods for decision tree induction," Machine learning, vol. 4, no. 2, pp. 227-243, 1989.
Omran, M. G., Salman, A., & Engelbrecht, A. P., "Dynamic clustering using particle swarm optimization with application in image segmentation," Pattern Analysis and Applications, vol. 8, no. 4, pp. 332-344, 2006.
Parker, D. B., "Learning logic," 1985.
Pawlak, Z. & Skowron, A., "Rudiments of rough sets," Information sciences, vol. 177, no. 1, pp. 3-27, 2007.
Pedrycz, W. & Bargiela, A., "Granular clustering: a granular signature of data," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 32, no. 2, pp. 212-224, 2002.
Perols, J. L. & Lougee, B. A., "The relation between earnings management and financial statement fraud," Advances in Accounting, vol. 27, no. 1, pp. 39-53, 2011.
Quinlan, J. R., C4. 5: programs for machine learning: Elsevier, 2014.
Ro, J. Y., Shen, S. S., Zhai, Q. J., & Ayala, A. G., Advances in Surgical Pathology: Prostate Cancer: Wolters Kluwer Health, 2012.
Rokach, L., Data mining with decision trees: theory and applications: World scientific, 2007.
Roobol, M. J., van Vugt, H. A., Loeb, S., Zhu, X., Bul, M., Bangma, C. H., et al., "Prediction of prostate cancer risk: the role of prostate volume and digital rectal examination in the ERSPC risk calculators," European urology, vol. 61, no. 3, pp. 577-583, 2012.
Rosenblatt, F., "The perceptron: a probabilistic model for information storage and organization in the brain," Psychological review, vol. 65, no. 6, p. 386, 1958.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J., "Learning internal representations by error propagation," DTIC Document1985.
Shah, S. & Kusiak, A., "Cancer gene search with data-mining and genetic algorithms," Comput Biol Med, vol. 37, no. 2, pp. 251-61, 2007.
Stenman, U.-H., Leinonen, J., Zhang, W.-M., & Finne, P., "Prostate-specific antigen," Seminars in Cancer Biology, vol. 9, no. 2, pp. 83-93, 1999.
Su, C.-T., Chen, L.-S., & Chiang, T.-L., "A neural network based information granulation approach to shorten the cellular phone test process," Computers in Industry, vol. 57, no. 5, pp. 412-423, 2006.
Su, C.-T., Chen, L.-S., & Yih, Y., "Knowledge acquisition through information granulation for imbalanced data," Expert Systems with applications, vol. 31, no. 3, pp. 531-541, 2006.
Tan, P.-N., Steinbach, M., & Kumar, V., Introduction to data mining vol. 1: Pearson Addison Wesley Boston, 2006.
Van der Merwe, D. & Engelbrecht, A. P., "Data clustering using particle swarm optimization," in Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, 2003, pp. 215-220.
Vapnik, V. N., Statistical Learning Theory: Wiley-Interscience, 1998.
Werbos, P., "Beyond regression: New tools for prediction and analysis in the behavioral sciences," 1974.
Wu, G. & Chang, E. Y., "KBA: Kernel boundary alignment considering imbalanced data distribution," Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 6, pp. 786-795, 2005.
Yao, Y., "Granular computing: basic issues and possible solutions," in Proceedings of the 5th Joint Conference on Information Sciences, 2000, pp. 186-189.
Zadeh, L. A., "Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems," Soft Computing-A fusion of foundations, methodologies and applications, vol. 2, no. 1, pp. 23-25, 1998.
Zhang, C., Ouyang, D., & Ning, J., "An artificial bee colony approach for clustering," Expert Systems with Applications, vol. 37, no. 7, pp. 4761-4767, 2010.
Zheng, Z., Wu, X., & Srihari, R., "Feature selection for text categorization on imbalanced data," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 80-89, 2004.
Zhu, W., "Generalized rough sets based on relations," Information Sciences, vol. 177, no. 22, pp. 4997-5011, 2007.
Zhu, W., "Topological approaches to covering rough sets," Information sciences, vol. 177, no. 6, pp. 1499-1508, 2007.

全文公開日期 2020/06/24 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文