簡易檢索 / 詳目顯示

研究生: Inggi Rengganing Herani
Inggi - Rengganing Herani
論文名稱: 運用複合式資料探勘方法建立頸動脈病變預測模型
Development of Carotid Artery Diagnostic Prediction Model using Hybrid Data Mining Approach
指導教授: 歐陽超
Chao Ou-Yang
口試委員: 郭人介
Ren-Jieh Kuo
楊朝龍
Chao-Lung Yang
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 56
中文關鍵詞: Carotid Artery DiseaseResamplingImbalance DataFeature SelectionBack Propagation Network
外文關鍵詞: Carotid Artery Disease, Resampling, Imbalance Data, Feature Selection, Back Propagation Network
相關次數: 點閱:281下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Carotid artery disease is the main caused of disability and death related with stroke or cerebrovascular disease, and in the worldwide medical issue, stroke was responsible for the high number of death. Because there are no symptoms of carotid artery disease, it is important to perform medical test using ultrasound or imaging method to visualize the carotid arteries. This kind of test is uncomfortable, expensive, and has some risks. Therefore, to reduce the risks and economic issue, this research presents method that generates some important information for the doctor to diagnose the carotid artery disease.
    Hybrid data mining approach is applied to produce some combination models. Dataset in real world are often imbalance. It dominated by normal data and only small percentage of abnormal or sick data. To overcome the imbalance dataset, we used Synthetic Minority Over-Sampling Technique (SMOTE) and Simple K-Means Clustering. While SMOTE is used to over-sampling the minority data, Clustering is used to under-sampling the majority data. Genetic Algorithm and Gain Ratio also used for selecting important features. These methods emphasized on selecting subset of salient features and reduced the number of features. Towards the end, new dataset would be processed using Back Propagation Network (BPN), Naive Bayes, and Decision Tree to predict the accuracy of the disease.
    Experimental results show that these hybrid methods achieved high accuracy, so it can assist doctors to analyze and predict the presence of carotid artery disease in patients.


    Carotid artery disease is the main caused of disability and death related with stroke or cerebrovascular disease, and in the worldwide medical issue, stroke was responsible for the high number of death. Because there are no symptoms of carotid artery disease, it is important to perform medical test using ultrasound or imaging method to visualize the carotid arteries. This kind of test is uncomfortable, expensive, and has some risks. Therefore, to reduce the risks and economic issue, this research presents method that generates some important information for the doctor to diagnose the carotid artery disease.
    Hybrid data mining approach is applied to produce some combination models. Dataset in real world are often imbalance. It dominated by normal data and only small percentage of abnormal or sick data. To overcome the imbalance dataset, we used Synthetic Minority Over-Sampling Technique (SMOTE) and Simple K-Means Clustering. While SMOTE is used to over-sampling the minority data, Clustering is used to under-sampling the majority data. Genetic Algorithm and Gain Ratio also used for selecting important features. These methods emphasized on selecting subset of salient features and reduced the number of features. Towards the end, new dataset would be processed using Back Propagation Network (BPN), Naive Bayes, and Decision Tree to predict the accuracy of the disease.
    Experimental results show that these hybrid methods achieved high accuracy, so it can assist doctors to analyze and predict the presence of carotid artery disease in patients.

    Abstract ii Table of Content iii List of Figure v List of Table vi CHAPTER I INTRODUCTION 1 1.1 Background 1 1.2 Purpose 2 1.3 Research Structure 2 CHAPTER II LITERATURE REVIEW 4 2.1 Carotid Artery Disease 4 2.1.1 Risk Factors and Symptoms 5 2.1.2 Diagnostic Testing 5 2.2 Data Mining and Knowledge Discovery 6 2.3 Data Collecting and Pre-processing 7 2.4 Imbalance Data Problem 7 2.4.1 SMOTE 8 2.4.2 K-Means Clustering 8 2.5 Feature Selection Method 9 2.5.1 Genetic Algorithm 9 2.5.2 Gain Ratio Attribute Evaluator 11 2.6 Predicting Model 12 2.6.1 Back Propagation Network (BPN) 12 2.6.2 Naive Bayes 14 2.6.3 Decision Tree (C4.5) 14 CHAPTER III RESEARCH METHODOLOGY 16 3.1 Data Pre-processing 21 3.1.1 Remove Outliers 21 3.1.2 Data Normalization 21 3.2 Dealing with Imbalance Data 22 3.2.1 SMOTE 22 3.2.2 K-Means Clustering 23 3.3 Selecting Features Method 23 3.3.1 Genetic Algorithm 23 3.3.2 Gain Ratio Attribute Evaluator 25 3.4 Buildup Predicting Model 26 3.4.1 Back Propagation Network (BPN) 26 CHAPTER IV MODEL IMPLEMENTATION 27 4.1 Data Analysis 27 4.2 Data Pre-processing 30 4.2.1 Remove Outliers 30 4.2.2 Data Normalization 31 4.3 Dealing with Imbalance Data 31 4.3.1 SMOTE 32 4.3.1.1 Random Remove Sampling 34 4.3.2 K-Means Clustering 34 4.4 Selecting Features 36 4.4.1 Genetic Algorithm 36 4.4.2 Gain Ratio Attribute Evaluation 37 4.4.3 Comparing Feature Selection 39 4.5 Prediction Model 40 4.5.1 Back Propagation Network (BPN) 41 4.5.2 Naive Bayes 42 4.5.3 Decision Tree (C4.5) 43 4.5.4 Support Vector Machine 44 4.6 Assessment 44 4.6.1 Model Comparison 44 4.6.2 Select the Best Model of BPN 45 4.6.3 Model Analysis 46 CHAPTER V CONCLUSION AND FUTURE RESEARCH 49 1.1 Conclusion 49 1.2 Future Research 50 REFERENCES 51

    Abdelhalim, A. and I. Traore (2009). A New Method for Learning Decision Trees from Rules. Machine Learning and Applications, 2009. ICMLA '09. International Conference on.

    Abraham, R., et al. (2006). A comparative analysis of discretization methods for Medical Datamining with Naive Bayesian classifier. Information Technology, 2006. ICIT '06. 9th International Conference on.

    Ahmad, S., et al. (2012). Outlier detection in logistic regression and its application in medical data analysis. Humanities, Science and Engineering (CHUSER), 2012 IEEE Colloquium on.

    Alaydie, N., et al. (2010). Noise and Outlier Filtering in Heterogeneous Medical Data Sources. Database and Expert Systems Applications (DEXA), 2010 Workshop on.

    Antaresti, T., et al. (2011). Maintaining imbalance highly dependent medical data using dirichlet process data generation. Digital Information Management (ICDIM), 2011 Sixth International Conference on.

    Bilge, U., et al. "Application of data mining techniques for detecting asymptomatic carotid artery stenosis." Computers & Electrical Engineering(0).

    Blagus, R. and L. Lusa (2013). "SMOTE for high-dimensional class-imbalanced data." BMC Bioinformatics 14(1): 1-16.

    Bruha, I. (2001). Pre- and Post-processing in Machine Learning and Data Mining. Machine Learning and Its Applications. G. Paliouras, V. Karkaletsis and C. Spyropoulos, Springer Berlin Heidelberg. 2049: 258-266.

    Burgher, P. (2012). "Data Aggregation and Normalization for Population Health Management." Retrieved 1 July, 2013, from http://www.wellcentive.com/data-aggregation-and-normalization/.

    Cadenas, J. M., et al. (2013). "Feature subset selection Filter–Wrapper based on low quality data." Expert Systems with Applications 40(16): 6241-6252.

    Chawla, N. V., et al. (2011). "SMOTE: synthetic minority over-sampling technique." arXiv preprint arXiv:1106.1813.

    Chen, F.-L., et al. (2010). "Applying moving back-propagation neural network and moving fuzzy neuron network to predict the requirement of critical spare parts." Expert Systems with Applications 37(6): 4358-4367.

    Chen, S. C., et al. (2006). Optimization of Back-Propagation Network Using Simulated Annealing Approach. Systems, Man and Cybernetics, 2006. SMC '06. IEEE International Conference on.

    Crouse, J. R., et al. (1987). "Risk factors for extracranial carotid artery atherosclerosis." Stroke 18(6): 990-996.

    Dag, H., et al. (2012). Comparison of feature selection algorithms for medical data. Innovations in Intelligent Systems and Applications (INISTA), 2012 International Symposium on.

    Dai, J. and Q. Xu (2013). "Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification." Applied Soft Computing 13(1): 211-221.

    Daszykowski, M., et al. (2001). "Looking for natural patterns in data: Part 1. Density-based approach." Chemometrics and Intelligent Laboratory Systems 56(2): 83-92.

    Dolkar, D. and B. Saha (2009). Optimal face recognition method using ant colony based Back Propagation network. Computers and Devices for Communication, 2009. CODEC 2009. 4th International Conference on.

    Fallahi, A. and S. Jafari (2011). "An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network." Int J Adv Sci Technol 34: 65-70.

    Fan, C.-Y., et al. (2011). "A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification." Applied Soft Computing 11(1): 632-644.

    Fanelli, F., et al. (2012). "Carotid artery stenting: analysis of a 12-year single-center experience." J Endovasc Ther 19(6): 749-756.

    Fowlkes, E. B. and C. L. Mallows (1983). "A method for comparing two hierarchical clusterings." Journal of the American statistical association 78(383): 553-569.

    Freitas, A. (2003). A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery. Advances in Evolutionary Computing. A. Ghosh and S. Tsutsui, Springer Berlin Heidelberg: 819-845.

    George H. John, P. L. (1995). Estimating Continuous Distributions in Bayesian Classifiers. San Mateo, Eleventh Conference on Uncertainty in Artificial Intelligence.

    Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley Professional.

    Grimaldi, M., et al. (2003). An evaluation of alternative feature selection strategies and ensemble techniques for classifying music. Workshop on Multimedia Discovery and Mining, Citeseer.

    Hamdani, T. M., et al. (2011). "Hierarchical genetic algorithm with new evaluation function and bi-coded representation for the selection of features considering their confidence rate." Applied Soft Computing 11(2): 2501-2509.

    Hauskrecht, M. and H. Fraser (2000). "Planning treatment of ischemic heart disease with partially observable Markov decision processes." Artificial Intelligence in Medicine 18(3): 221-244.

    Heaton, J. (2008). "The Number of Hidden Layers ". Retrieved 1 July, 2013, from http://www.heatonresearch.com/node/707.

    Hong, J.-H. and S.-B. Cho (2006). "Efficient huge-scale feature selection with speciated genetic algorithm." Pattern Recognition Letters 27(2): 143-150.

    Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities." Proceedings of the National Academy of Sciences 79(8): 2554-2558.

    Huang, C.-L. and C.-J. Wang (2006). "A GA-based feature selection and parameters optimizationfor support vector machines." Expert Systems with Applications 31(2): 231-240.

    Il-Seok, O., et al. (2004). "Hybrid genetic algorithms for feature selection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 26(11): 1424-1437.

    Istrate, M. (2010). "Data Preprocessing in Web Usage Mining." Ovidius University Annals, Economic Sciences Series 0(1): 688-691.

    Jain, A. (2008). Data Clustering: 50 Years Beyond K-means. Machine Learning and Knowledge Discovery in Databases. W. Daelemans, B. Goethals and K. Morik, Springer Berlin Heidelberg. 5211: 3-4.

    Jia, L., et al. (2011). Application of Random-SMOTE on Imbalanced Data Mining. Business Intelligence and Financial Engineering (BIFE), 2011 Fourth International Conference on.

    Kaastra, I. and M. Boyd (1996). "Designing a neural network for forecasting financial and economic time series." Neurocomputing 10(3): 215-236.

    Karegowda, A. G. and M. A. Jayaram (2009). Cascading GA & CFS for Feature Subset selection in Medical Data Mining. Advance Computing Conference, 2009. IACC 2009. IEEE International.

    Kim, G., et al. (2000). "Feature selection using genetic algorithms for handwritten character recognition."

    Kononenko, I. (2001). "Machine learning for medical diagnosis: history, state of the art and perspective." Artificial Intelligence in Medicine 23(1): 89-109.

    Kudo, M. and J. Sklansky (2000). "Comparison of algorithms that select features for pattern classifiers." Pattern Recognition 33(1): 25-41.

    Kumar, R. and A. Indrayan (2011). "Receiver operating characteristic (ROC) curve for medical researchers." Indian pediatrics 48(4): 277-287.

    Kuncheva, L. I. and L. C. Jain (1999). "Nearest neighbor classifier: Simultaneous editing and feature selection." Pattern Recognition Letters 20(11–13): 1149-1156.

    Lanzi, P. L. (1997). Fast feature selection with genetic algorithms: a filter approach. Evolutionary Computation, 1997., IEEE International Conference on.

    Lavrač, N. and B. Zupan (2005). Data mining in medicine, Springer.

    Lawera, M. (1995). "Predictive Inference: An Introduction." Technometrics 37(1): 121-121.

    Li, T.-S. (2006). "Feature selection for classification by using a GA-based neural network approach." Journal of the Chinese Institute of Industrial Engineers 23(1): 55-64.

    Liu, H. and H. Motoda (1998). Feature selection for knowledge discovery and data mining, Springer.

    Liu, J., et al. (2008). "A comparative study on rough set based class imbalance learning." Knowledge-Based Systems 21(8): 753-763.

    Maciejewski, T. and J. Stefanowski (2011). Local neighbourhood extension of SMOTE for mining imbalanced data. Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on.

    Maddouri, M. and M. Elloumi (2002). "A data mining approach based on machine learning techniques to classify biological sequences." Knowledge-Based Systems 15(4): 217-223.

    Medicine, J. H. (n.d.). "Carotid Endarterectomy." Retrieved 1 July, 2013, from http://www.hopkinsmedicine.org/healthlibrary/test_procedures/cardiovascular/carotid_endarterectomy_carotid_angioplasty_with_stenting_92,P08293/.

    Mountassir, A., et al. (2012). An empirical study to address the problem of Unbalanced Data Sets in sentiment classification. Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on.

    National Heart, L., and Blood Institute (2010). "Carotid Artery Disease." Retrieved 1 July, 2013, from http://www.nhlbi.nih.gov/health//dci/Diseases/catd/catd_whatis.html.

    Penny, W. and D. Frost (1996). "Neural networks in clinical medicine." Medical Decision Making 16(4): 386-398.

    Podgorelec, V., et al. (2002). "Decision Trees: An Overview and Their Use in Medicine." Journal of Medical Systems 26(5): 445-463.

    Salzberg, S. (1994). "C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993." Machine Learning 16(3): 235-240.

    Service, N. H. (2012). "Stroke - Symptoms." Retrieved 1 July, 2013, from http://www.nhs.uk/Conditions/Stroke/Pages/Symptoms.aspx.

    Shah, M., et al. (2011). "Evaluating intensity normalization on MRIs of human brain with multiple sclerosis." Medical Image Analysis 15(2): 267-282.

    Tan, K. C., et al. (2009). "A hybrid evolutionary algorithm for attribute selection in data mining." Expert Systems with Applications 36(4): 8616-8630.

    Ture, M., et al. (2009). "Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients." Expert Systems with Applications 36(2, Part 1): 2017-2026.

    Vlahou, A., et al. (2003). "Diagnosis of ovarian cancer using decision tree classification of mass spectral data." BioMed Research International 2003(5): 308-314.

    XL, N. (2012). "Neural Networks in Medicine." Retrieved 1 July, 2013, from http://neuroxl.com/applications/medicine/neural-networks-in-medicine/index.htm.

    Yan-ping, Z., et al. (2010). Cluster-based majority under-sampling approaches for class imbalance learning. Information and Financial Engineering (ICIFE), 2010 2nd IEEE International Conference on.

    Yeh, D.-Y., et al. (2011). "A predictive model for cerebrovascular disease using data mining." Expert Systems with Applications 38(7): 8970-8977.

    Yen, S.-J. and Y.-S. Lee (2009). "Cluster-based under-sampling approaches for imbalanced data distributions." Expert Systems with Applications 36(3, Part 1): 5718-5727.

    Yi, M., et al. (2011). Medical Data Mining for Early Deterioration Warning in General Hospital Wards. Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on.

    Zhang, H. and S. Shengli (2004). Learning weighted naive Bayes with accurate ranking. Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on.

    QR CODE