簡易檢索 / 詳目顯示

研究生: 林松江
Sung-Chiang Lin
論文名稱: 利用後設學習及迴歸模式處理不平衡資料之分類問題
Meta-learning for Imbalanced Data and Classification Ensemble via Regression
指導教授: 楊維寧
Wei-Ning Yang
口試委員: 張源俊
Yuan-chin Ivan Chang
陳雲岫
Yun-Shiow Chen
呂永和
Yung-ho Leu
李育杰
Yuh-Jye Lee
學位類別: 博士
Doctor
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 86
中文關鍵詞: 接受者操作特徵曲線ROC曲線下面積不平衡資料後設學習費雪線性區別分析法
外文關鍵詞: receiver operating characteristic(ROC), area under the curve(AUC), imbalanced data, meta learning, Fisher’s linear discriminant
相關次數: 點閱:369下載:10
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,隨著網路技術的快速發展,以及資料量的大量成長,對於企業想要透過資訊科技與統計分析的整合,以獲得更精準且即時(real time)的預測與回應市場趨勢來提升企業經營績效的目標而言,如何有效且快速地分析這些大量的資料並從中擷取重要的資訊則成為企業重視的課題,這也是資料探勘(data mining)中相當重要的應用。而在資料探勘的應用技術中,統計與機器學習理論(statistical/machine learning theory)則是相當重要的一環,不僅是重要的研究課題,且被廣泛應用於各個不同的科學領域。而在統計與機器學習理論的研究上,有關分類學習的研究一直扮演著重要的角色,也發展出不少的方法,如類神經網路(artificial neural network, ANN)、決策樹(decision tree)、貝氏學習(Bayesian learning)與支持向量機(support vector machine, SVM)等。
    本論文主要探討在統計與機器學習理論(statistical/machine learning theory)的分類應用上,所面臨的類別資料不平衡(class-imbalanced)的問題,在此種型態資料中,一個類別的樣本數會遠超過其它類別的樣本數,使得類別樣本的分布呈現偏斜狀況(skewed class distribution)。所謂分類學習問題是指根據已知的類別資料,依其屬性或特性建立資料的分類模型,並利用此分類模型來預測新資料的類別,並提升整體分類的正確率。而在傳統的分類方法上,對於不平衡資料(imbalanced data sets)的分類問題,通常會傾向於預測為多數類別(major class)而忽略了少數類別(minor class),這樣的預測方式雖然可以獲得很高的整體正確率(accuracy),但卻會降低少數類別的預測正確率。而相較於多數類別樣本,少數樣本通常是較有趣且重要的類別(例如,醫學診斷資料中的罕見疾病、監測資料中的錯誤資料、信用卡審查中的詐騙資料等),但在類別樣本的分布呈現偏斜狀況的情形下,傳統的分類模型會傾向追求多數類別樣本的高分類正確率,而忽略對少數類別的預測正確率;換言之,對於不平衡資料的類別預測,雖然可以獲得很高的整體正確率,但其中敏感性(sensitivity)卻會很低,導致真正重要的資料無法被正確預測出來,所以,並不適合用來處理類別不平衡的資料。
    而目前解決類別不均問題的方法,主要有抽樣(sampling)、敏感學習(cost-sensitive learning)等平衡資料的方式,抽樣方法又可分為增加正例(over-sampling,duplicate positive)與減少負例(under-sampling,reduce negative)等技巧,此抽樣方法的最大缺點在於會改變原始資料的分布情形而導致預測模組失真;而敏感學習則是強調正負例應該給予不同的錯誤分類成本,以減少整體分類所花費的成本,此方法的最大缺點則在於需要使用者自行定義不同目標類別的錯誤分類成本,當使用者對於該領域的知識不足時,並無法定義出適當的錯誤成本。因此,本論文提出MICE演算法來處理不平衡資料的分類問題,此方法首先針對多數類別資料進行投影分割並保留其與少數類別資料在空間上的相對關係,接著利用分割後的結果建構子分類器(sub-classifiers)並利用羅吉斯迴歸(logistic regression)的特性來轉換輸出為機率值,最後並以羅吉斯迴歸來建構組合模型。
    此MICE演算法簡化了複雜的參數調整(parameter tuning)過程,並且在實驗中透過模擬資料與實際資料證明此MICE演算法對於不平衡資料的分類具有較高的學習效能,可以在整體正確率與敏感性比例之間取得較好的平衡;此外, MICE演算法利用了空間上的投影分割方式,使得一般的線性模型,如:Fisher’s linear discriminant,可直接應用在不平衡資料的分類問題上。研究結果顯示,本文提出的MICE演算法,對於不平衡資料的分類問題獲得了一個相當有效率的解決方式。


    In this dissertation, we focus on the application of machine learning to binary classification with highly imbalanced data distribution. This is a very common problem when the examples of interest are relatively rare, such as in bio-informatics, medical diagnostic/monitoring, network security/intrusion detection, etc.. In other words, in an imbalanced data set, the majority group is represented by a large portion of all the examples, whereas the minority group has only a small part of all examples. In these applications, a typical classifier tends to classify most examples into the majority group and few examples into the minority group. As a result, even though the classification accuracy might be very high, it is incapable of recognizing the minority group of examples. Therefore, how to locate the rare examples is an important issue and the accuracy is no longer a solo appropriate assessment of the performance. Thus, the naïve classification algorithm will usually fail or require some complicated parameter tuning processes to improve its performance in such a situation. In this dissertation, we propose the ``Meta Imbalanced Classification Ensemble (MICE)" algorithm for constructing a classifier ensemble based on the meta information of linear sub-classifiers, which are trained with the minority group versus the partitions of the majority group, whose sample size overwhelmingly outnumbers its counterpart.
    The MICE algorithm improves the performance based on two important steps, the partition with transformed features and ensemble with logistic regression. The idea of partitioning the majority group to dilute the effect of imbalanced data is not novel, but in the MICE, the majority group is partitioned based on the transformed features from ``inner product" to retain the geometric relation between the majority group and the minority group. For the class-imbalanced problem, it is not enough to use only the average accuracy to express the performance of a classifier. Thus, the more appropriate measurements for class-imbalanced problem, the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC), are also used to evaluate the performance and the empirical results show that the performance of MICE is better than other renowned classification methods in terms of the specificity and the sensitivity.

    Contents 1 Introduction 1 2 Preliminaries 5 2.1 Linear Fisher's Discriminant Analysis . . . . . . . . . . . 5 2.2 Support Vector Machines . . . . . . . . . . . . . . . . . . 6 2.2.1 The Linear Case: Separable/Non-separable . . . . 6 2.2.2 The Nonlinear Case with Kernel Trick . . . . . . 9 2.2.3 The Class-Imbalanced Case . . . . . . . . . . . . 10 2.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . 10 2.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . 11 2.3.2 Nonhierarchical Clustering . . . . . . . . . . . . 12 2.3.3 Two-Stage Clustering . . . . . . . . . . . . . . . . 14 2.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . 15 3 Classification by Assembling 19 3.1 Decomposition of Majority Group . . . . . . . . . . . . . 21 3.1.1 Partition under Projection . . . . . . . . . . . . . 22 3.2 Probability versus Function Value . . . . . . . . . . . . . 24 3.2.1 Logistic Transformation . . . . . . . . . . . . . . 26 3.2.2 Explanation of Transformed Probability . . . . . 27 3.3 Final Ensemble . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 Procedure of Model Selection . . . . . . . . . . . 31 4 Numerical Results 43 4.1 Measurements of the Performance . . . . . . . . . . . . . 43 4.2 Compared Approaches . . . . . . . . . . . . . . . . . . . 46 4.3 Synthesized Data Set . . . . . . . . . . . . . . . . . . . . 47 4.4 Some Benchmark Data . . . . . . . . . . . . . . . . . . . 52 5 Discussions and Future Directions 67

    References
    Ali, S. and Smith-Miles, K.A. (2006). A meta-learning approach to automatic kernel selection for support vector machines. Neuro-computing 70, 173-186.
    Alpaydin, E. (2004). Introduction to machine learning. U.S.A.: MIT Press.
    Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 12, 387-415.
    Barandela, R., Sanchez, J.S., Garcia, V. and Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition 36, 849-851.
    Burges, C.J.C. (1998). A tutorial on support vector machines for pattern recognition. Data mining and Knowledge Discovery 2, 955-974.
    Chan, P.K. and Stolfo, S.J. (1998). Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proc. Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164-168. AAAI Press.
    Chang, C.C. and Lin, C.J. (2001). LIBSVM: a library for Support Vector Machines,
    http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz.
    Chang, Y.-c.I. and Lin, S.-C. (2004). Synergy of logistic regression and support vector machine in multi-class classification. In Proc. IDEAL 2004, Volume LNCS 3177, Berlin-Heiferlberg, pp. 132-141. Springer.
    Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321-357.
    Chawla, N.V., Japkowicz, N. and Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. In Proc. SIGKDD Explor. Newsl., Volume 6, pp. 1-6. ACM SIGKDD Explorations.
    Christophe, G.C., Vilalta, R. and Brazdil, P. (2004). Introduction to the special issue on meta-learning. Machine Learning 54, 187-193.
    Cohen, G., Hilario, M., Sax, H., Hugonnet, S. and Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of noso-comial infection. Artificial Intelligence in Medicine 37, 7-18.
    Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and ROC curves. In Proceedings of the 23 rd International Conference on Machine Learning.
    Dietterich, T.G. (2000). Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, Volume 1857, pp. 1-15. Springer-Verlag.
    Dietterich, T.G. and Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research 2, 263-286.
    Dunham, M.H. (2003). Data Mining: Introductory and Advanced Topics. Upper Saddle River, N.J.: Prentice Hall/Pearson Education INC.
    Dzeroski, S. and Zenko, B. (2004). Is combining classifiers with stacking better than selecting the best one?. Machine Learning 54, 255-273.
    Estabrooks, A. (2004). A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20, 18-36.
    Firth, D. (1992). Advances in GLIM and Statistical Modelling. In Chapter Bias reduction, the JeRerys prior and GLIM, edited by L. Fahemeir, R. G. Francis and G. Tutz, 91-100. New York: Springer-Verlag.
    Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika 80, 27-38.
    Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics 28, 337-407.
    Han, H., Wang, W.-Y. and Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Advances in Intelligent Computing, 878-887.
    Heinze, G. and Schemper, M. (2000). A solution to the problem of separation in logistic regression. Statistics in Medicine 21, 2109-2149.
    Imam, T., Ting, K.M. and Kamruzzaman, J. (2006). z-SVM: An SVM for Improved Classificationof Imbalanced Data. In A. Sattar and B. Kang (Eds.), AI 2006: Advances in Artificial Intelligence, Volume LANI 4304, pp. 264{273. Springer.
    Janes, H. and Pepe, M.S. (2006). The optimal ratio of cases to controls for estimating the classification accuracy of a biomaker. Biostatistics 7, 456-468.
    Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. In Tech Rep. WS-00-05, Menlo Park, C.A. AAAI Workshop on Learning from Imbalanced Data Sets: AAAI Press.
    Kang, P. and Cho, S. (2006). EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems. In I. King et al. (Eds.) Prof. ICONIP 2006, Volume LNCS 4232, pp. 837V846. Berlin-Heidelberg: Springer-Verlag.
    Li, C. (2007). Classifying imbalanced data using a bagging ensemble variation (bev). In Proc. ACM-SE 45, New York, pp. 203-208. ACM.
    Lin, S.-C., Chang, Y.-c.I. and Yang, W.-N. (2009). Meta-learning for Imbalanced Data and Classification Ensemble in Binary Classification. Neurocomputing 73, 484-494.
    Lin, C.-C., Tsai, Y.-S., Lin, Y.-S., Chiu, T.-Y., Hsiung, C.-C., Lee, M.-I., Simpson, J.C. and Hsu, C.-N. (2007). Boosting multiclass learning with repeating codes and weak detectors for protein subcellular localization. Bioinformatics 23, 3374-3381.
    Liu, J., Hu, Q. and Yu, D. (2008). A weighted rough set based method developed for class imbalance learning. Information Sciences 178, 1235-1256.
    Liu, X.Y., Wu, J. and Zhou, Z.H. (2006). Exploratory under-sampling for class-imbalance learning. In Proc. the Sixth International Conference on Data Mining, Washington, pp. 965-969. IEEE Computer Society.
    Liu, X.Y. and Zhou, Z.H. (2006). The in°uence of class imbalance on the cost-sensitivive learninig: An empirical study. In Proc. ICDM '06. IEEE The Computer Socirty.
    Marrocco, C., Molinara, M. and Tortorella, F. (2005). Optimal linear combination of dichotomizers via AUC. In Proceedings of the ICML 2005 workshop on ROC Analysis in Machine Learning.
    McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models (2 ed.). New York: Chapman and Hall.
    Meir, R. and RÄatsch, G. (2003). An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning, edited by S. Mendelson and A. Smola, 119-184. New York: Springer-Verlag.
    Opitz, D. and Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research 11, 169-198.
    Orriols, A. and Bernad¶o-Mansilla, E. (2005). The class imbalance problem in learning classifier systems: A preliminary study. In Prof. GECCO'05. U.S.A.: ACM.
    Prodromidis, A.L., Chan, P.K. and Stolfo, S.J. (2002). Meta-learning in distributed data mining systems: Issues and approaches. In Advances of Distributed Data Mining. AAAI Press.
    Prudencio, R.B.C. and Ludermir, T.B. (2004). Meta-learning approaches to selecting time series models. Neurocomputing 61, 121-137.
    Punj, G. and Stewart, D.W. (1983). Cluster analysis in marketing research: Review and suggestion for application. Journal of Marketing Research 20, 134-148.
    R Development Core Team (2007). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.
    Rencher, A.C. (2002). Methods of Multivariate Analysis. U.S.A.: Wiley.
    Shanahan, J.G. and Roma, N. (2003). Improving SVM text classification performance through threshold adjustment. In N. Lavra et al. (Eds.) Prof. ECML, Volume LNAI 2837, pp. 361-372. Berlin-Heidelberg: Springer.
    Sharma, S. (1996). Applied multivariate techniques. New York: Wiley.
    Su, J.Q. and Liu, J.S. (1993). Linear combinations of multiple diagnostic markers. Journal of the American Statistical Association 88, 1350-1355.
    Tao, Q., Wu, G.-W., Wang, F.-Y. and Wang, J. (2005). Posterior probability support vector machines for unbalanced data. IEEE Transactions on Neural Networks 16, 1561-1573.
    Timm, N.H. (2002). Applied Multivariate Analysis. New York: Springer.
    Ting, K.M. and Witten, I.H. (1999). Issues in stacked generalization. Journal of Artificial Intelligence Research 10, 271-289.
    Vilalta, R. and Drissi, Y. (2002). A perspective view and survey of metalearning. Artificial Intelligence Review 18, 77-95.
    Zhao, J. H., Li, X. and Dong, Z.Y. (2007). Online rare events detection. In Z.H. Zhou and H. Li and Q. Yang (Eds.) Advances in Knowledge Discovery and Data Mining, Volume LNAI 4426, pp. 1114-1121. Berlin-Heidelberg: Springer.
    Zhu, J. and Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics 14, 185-205.
    Zhuang, L. and Dai, H. (2006). Parameter optimization of kernel-based one-class classifier on imbalance learning. Journal OF Computers 1, 32-40.

    QR CODE