簡易檢索 / 詳目顯示

研究生: 慕哈曼
Muhammad - Rieza
論文名稱: 發展一多階數資料探勘方法建立腦中風風險預測模型
Applying Hybrid Data Preprocessing Methods in Stroke Prediction
指導教授: 歐陽超
Chao Ou-Yang
口試委員: 郭人介
Ren-Jieh Kuo
楊朝龍
Chao-Lung Yang
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 62
中文關鍵詞: 中風不平衡資料特徵選擇預測方法
外文關鍵詞: Stroke, Imbalance Data, Feature selection, Prediction Method
相關次數: 點閱:201下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 中風對於人類健康造成嚴重威脅已成為一世界性問題。腦部影像學檢查以及超音波是目前用以診斷中風的儀器。資料探勘已廣泛用於許多領域,包括醫療工業。資料探勘的使用可幫助醫生預測某些疾患。因此,在這個研究中,多階數資料探勘方法整合了不平衡資料的前期處理、特徵選擇、後傳播網路、支持向量机、決策樹來對於中風進行預測。

    本研究中,腦部影像資料的蒐集時段為2004至2011年。然而,可用資料的不平衡對於預測以及特徵選擇造成了衝擊。這個研究先藉由比較樣本方式對於資料進行“再平衡(rebalance)”處理, Random Under Sampling by Age and RUSboost。除此之外,藉由訊息獲得以及階層式迴歸分析對於平衡資料的重要特色加以篩選。最後,使用後傳播網路、支持向量机 以及決策樹來處理被選擇的特徵以對於中風進行預測。這個多階數資料探勘方法將幫助醫生提供相關訊息給患者。


    Stroke has always been highlighted as a big threat of health in the worldwide. Brain image examination and ultrasound are some alternatives to discover stroke disease. Data mining has been used widely in many areas, including medical industry. The uses of data mining methods allow doctors to make prediction of certain diseases. Therefore, in this research, hybrid model integrating imbalance data preprocessing, feature selection, and back propagation network, support vector machine, decision tree for stroke prediction.
    The dataset used is brain examination data which collected from 2004 to 2011. However, highly imbalance dataset available can impact the performance of prediction as well as feature selected. The study firstly “rebalance” the dataset by comparing sampling methods; Random Under Sampling by Age and RUSboost. In addition, important features of balance training dataset would be selected by information gain and stepwise regression analysis. Towards the end, selected features would be processed using Back Propagation Network, Support Vector Machine and Decision Tree to predict the stroke. These hybrid methods may assist doctor to provide some possibilities information to the patient.

    ABSTRACT III 摘要 IV ACKNOWLEDGMENT V TABLE OF CONTENTS VI LIST OF FIGURES VIII LIST OF TABLES IX CHAPTER I INTRODUCTION 1 1.1 BACKGROUND 1 1.2 PURPOSES 2 1.3 SCOPES AND CONSTRAINTS 3 1.4 RESEARCH FRAMEWORK 3 CHAPTER II LITERATURE REVIEW 5 2.1 CEREBROVASCULAR DISEASE 5 2.1.1 Stroke Risk Factor 6 2.2 DATA MINING TECHNOLOGY 7 2.2.1 Data Preprocessing 9 2.2.2 Predictive Methods 13 CHAPTER III METHODOLOGY 17 3.1 DESIGN PHASE 17 3.2 DATA PREPROCESSING 21 3.2.1 Remove Outlier and Normalization 21 3.2.2 Handling Imbalance Dataset 22 3.2.3 Feature Selection 23 3.3 PREDICTION MODEL 25 3.3.1 Back Propagation Network 25 3.3.2 Support Vector Machine 26 3.3.3 Decision Tree 27 3.4 METHOD EVALUATION 27 CHAPTER IV MODEL IMPLEMENTATION 29 4.1 DATA SETS 29 4.2 DATA PREPROCESSING 31 4.3 HANDLING IMBALANCE DATASETS 32 4.3.1 Random Undersampling by Age Attribute (RUS_Age) 33 4.3.2 Random Undersampling Boost (RUSBoost) 34 4.4 FEATURE SELECTION 35 4.4.1 Information Gain 35 4.4.2 Stepwise Regression Analysis 37 4.5 PREDICTION METHOD 38 4.5.1 Back Propagation Network 38 4.5.2 Support Vector Machine (SVM) 40 4.5.3 Decision Tree 41 4.6 RESULT ANALYSIS AND EVALUATION 41 4.6.1 Feature Selection 41 4.6.2 Prediction Evaluation 44 4.6.3 Statistical Test 45 4.7 DATASET COMPARISON WITH ADDITIONAL FEATURE 47 4.7.1 Data sets 47 4.7.2 Imbalance Dataset and Feature Selection 48 4.7.3 Prediction Method 51 4.7.4 Result Analysis 53 4.8 BEST MODEL ANALYSIS AND DISCUSSION 54 CHAPTER V CONCLUSION AND FUTURE RESEARCH 57 5.1 CONCLUSION 57 5.2 RESEARCH CONTRIBUTION 58 5.3 FUTURE RESEARCH 58 REFERENCES 59

    Benjamin, K. T., Tom, B. Y. L., Samuel, W. K. C., Weijun, G., & Xuegang, Z. 2000. Enhancement of a Chinese discourse marker tagger with C4.5. In Annual Meeting of the ACL (Proceedings of the second workshop on Chinese language processing: Held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Vol. 12, pp. 38-45). Morristown, NJ, USA: Association for Computational Linguistics.
    Caplan, L. R. 2009. Stroke Prevention. In Louis R. Caplan (ed.), Caplan’s Stroke (Fourth ed.). Elsevier Inc.
    Caplan, L. R. 2009. Subarachnoid Hemorrhage, Aneurysms, and Vascular Malformations. In Louis R. Caplan (ed.), Caplan’s Stroke (Fourth ed.). Elsevier Inc.
    Casella, G., and Berger, R. L. 2001. Statistical Inference (2nd ed.). Duxbury. ISBN 0-534-24312-6.
    Chang, C. L. and. Chen, C.H. 2009. Applying decision tree and neural network to increase quality of dermatologic diagnosis. Expert Systems with Applications, vol. 36, no. 2, pp. 4035–4041.
    Chang, P. C., and Fan, C.-Y. 2008. A hybrid system integrating awavelet and TSK fuzzy rules for stock price forecasting, IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews 38 (6) 802–815.
    Chang, P. C., Liu, C.-H. 2008. A TSK type fuzzy rule based system for stock price prediction, Expert Systems with Applications 34 (1) 135–144.
    Dağ, H., Sayın, K. E., Yenidoğan, I., Albayrak, S., and Acar, C. 2012. Comparison of Feature Selection Algorithms for Medical Data. Innovations in Intelligent Systems and Applications (INISTA), 2012 International Symposium on Digital Object Identifier: 10.1109/INISTA.2012.6247011. IEEE:1-5.
    Drucker, H., Wu, D., and Vapnik, V. 1999. Support vector machines for spam categorization, IEEE Trans Neural Network 10 (3) 1048.
    Fan, C.-Y., Chang, P.-C., Lin, J.-J., and Hsieh, J. C. 2011. A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification. Applied Soft Computing, vol. 11, no. 1, pp. 632–644.
    Farquad, M. A. H., and Bose, I. 2012. Preprocessing unbalanced data using support vector machine. Decision Support Systems, vol. 53, no. 1, pp. 226–233.
    Freitus, A. A. 2002. A survey of evolutionary algorithms for data mining and knowledge discovery. In: A. Ghosh, and Tsutsui, S. (Eds.), Advances in Evolutionary Computation, Springer, Berlin.
    Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
    Howard, G., and Howard, V. J. 2004. Distribution of Stroke: Heterogeneity of Stroke by Age, Race, and Sex in J. P. Mohr, Choi, D. W., Grotta, J. C., Weir, B., and Wolf, P. A., (eds.). Stroke (Fourth Edition). Elsevier Inc.
    Hoyte, L. C., and Buchan, A. M. 2009. Animal Models of Stroke. In Larry R. Squire (ed). Encyclopedia of the Neurological Sciences: pp 465–472.
    Hu, H. H., Chu, F. L., Chiang, B. N. et al. 1989. Prevalence of stroke in Taiwan. Stroke, 20: pp. 858–863.
    Hu, X. 2003. DB-reduction: A data preprocessing algorithm for data mining applications. Applied Mathematics Letters, 16, 889–895.
    Informatics, M., Sciences, M., Kumar, R., and Informatics, M. 2011. Receiver Operating Characteristic (ROC) Curve for Medical Researchers. no. IV.
    Kaastra, I., and Boyd, M. 1996. Designing a neural network for forecasting financial and economic timeseries, Neurocomputing 10: 215–236.
    Kumar, D. S., Sathyadevi, G., and Sivanesh, S. 2011. Decision Support System for Medical Diagnosis Using Data Mining. vol. 8, no. 3, pp. 147–153.
    Leifer, D. 2009. Stroke. Encyclopedia of Neuroscience, pp 573-578. Academic Press.
    Li, D.-C., and Liu, C.-W. 2010. A class possibility based kernel to increase classification accuracy for small data sets using support vector machines. Expert Systems with Applications, vol. 37, no. 4, pp. 3104–3110.
    Mitchell, M. T. 1997. Machine learning. Singapore: McGraw-Hill.
    Pan, S., Iplikci, S., Warwick, K., and Aziz, T. Z. 2012. Parkinson’s Disease tremor classification – A comparison between Support Vector Machines and neural networks. Expert Systems with Applications, vol. 39, no. 12, pp. 10764–10771.
    Quinlan, J. R. 1986. Induction of decision trees. Machine Learning, 1, 81–106.
    Rahman, M. M., and Davis, D. N. 2013. Addressing the Class Imbalance Problem in Medical Datasets.
    Ripley, B. D. 1993. Statistical aspects of neural networks. In: O. E. Barndoff-Neilsen Jensen, Jensen, J.L., and Kendall, W.S. (Eds.), Networks and Chaos—Statistical and Probabilistic Aspects, Chapman & Hall, London, pp.40–123.
    Ronco, A. L. 1999. Use of artificial neural networks in modeling associations of discriminant factors : towards an intelligent selective breast cancer screening. vol. 16, pp. 299–309.
    Roobaert, D., Hulle, M. M. 1999. View based 3D object recognition with support vector machines, in: Proceedings of the IEEE International Workshop on Neural Networks for Signal Processing, IEEE, Wisconsin, p. 77.
    Sartakhti, J. S., Zangooei, M. H., and Mozafari, K. 2012. Hepatitis disease diagnosis using a novel hybrid method based on support vector machine and simulated annealing (SVM-SA). Computer methods and programs in biomedicine, vol. 108, no. 2, pp. 570–9.
    Schmidt, M., and Grish, H. 1996. Speaker identification via support vector classifiers, in: Proceeding of the International Conference on Acoustics, Speech and Signal Processing, IEEE, Long Beach, CA, p. 105.
    Seiffert, T., Khoshgoftaar, J., Van H., Napolitano, A. 2010. Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A 40:185-197.
    Shi, F., Hart, R. G., Sherman, D. G., and Tegeler, C. H. 1989. Stroke in the People's Republic of China. Stroke 20: pp. 1581–1585.
    Soni, J. 2011. Predictive Data Mining for Medical Diagnosis : An Overview of Heart Disease Prediction. vol. 17, no. 8, pp. 43–48.
    Tseng, M. C., and Lin, H. J. 2009. Readmission after Hospitalization for Stroke in Taiwan: Results From A National Sample. J. of the Neuro. Sci. 284: 52–55.
    Ture, M., Tokatli, F., and Kurt, I. 2009. Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients. Expert Systems with Applications, vol. 36, no. 2, pp. 2017–2026.
    Vapnik, N. V. 2000. The Nature of Statistical Learning Theory, Statistics for Engineering and Information Science, 2nd Edition. Springer-Verlag, New York.
    Witten, I. H., and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Series in Data Management Systems.
    Wolf, P. A. 2004. Epidemiology of Stroke. In J.P. Mohr, Choi, D. W., Grotta, J. C., Weir, B., and Wolf, P. A., (eds.). Stroke (Fourth Edition). Elsevier Inc.
    Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., and Steinberg, D. 2007. Top 10 algorithms in data mining, vol. 14, no. 1. pp. 1–37.
    Yeh D., Cheng, C., and Chen, Y. 2011. Expert Systems with Applications A predictive model for cerebrovascular disease using data mining. Expert Systems With Applications, vol. 38, no. 7, pp. 8970–8977.

    QR CODE