簡易檢索 / 詳目顯示

研究生: 張呈光
Ting-Kng Tiun
論文名稱: 植基於主成份分析與資料類別關係之遞進式特徵提取方法
A Sequeacial Feature Selecting Strategy Based on Relevance Between Data Label and Principle Component
指導教授: 楊維寧
Wei-Ning Yang
口試委員: 陳雲岫
Yun-Shiow Chen
呂永和
Yung-Ho Leu
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 24
中文關鍵詞: 特徵提取主成份分析相關係數ROC曲線下面積費雪資訊基因演算法接收者操作特徵曲線
外文關鍵詞: PCA, R square, Fisher Information, Area under ROC curve
相關次數: 點閱:453下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 二元分類是依據屬性向量來判別物件的類別,當面對高維度屬性向量時,傳統分類法會遭遇計算上的困難,進而衍生出選取有效屬性以降低屬性向量維度的需要。面對較高維度屬性之資料,傳統上分為兩種方式來進行降低維度,一為選擇一組原始資料屬性之子集合,透過此子集合來進行分類,冀望透過選出之子集合能夠有較好的分類效能,保留原始資料屬性能夠保留資料屬性的可解釋性,卻往往亦保留了原始屬性間的互相關係,屬性間的相依狀態,造成尋找合適之子集合難度大大增加,不可必免的需要使用窮舉法以進行屬性的挑選,使得屬性選擇的難度大大提昇;另一方法為將所有原始資料屬性透過演算法揉合成不同的「屬性向量」,透過選擇這些「屬性向量」以提昇分類之效能,此方法雖然會失去原始資料屬性的解釋能力,卻能透過演算法的幫助,降低屬性間的相乎依賴關係,從而提昇屬性選擇之合理性。

    我們提出一種基於資料類別與主成份分析間的相關性之特徵提取方法,利用主成份分析之各別主成份之間互相不相關 (Uncorrelated) 的特性,進行屬性選擇,每個主成份並沒有對於分類有直接的貢獻,因此我們思考如何將類別資訊帶入每個主成份,使得每個主成份能夠反應出分類的能力差別,讓我們能夠透過此差別來進行選擇。我們提出了三種不同的資料類別與主成份間的關聯方式,分別為「相關係數(R^2)」、「ROC 曲線下面積」、「費雪資訊」並與原始的「主成份分析」方法進行比較。本研究中對於原始資料的使用「主成份分析」,將資料轉換成互相不相關的「屬性向量」,將這些「資料向量」透過「相關係數( R^2 )」、「ROC 曲線下面積」、「費雪資訊」等方法進行計算,將計算之結果排序並加以選擇。在本研究中,我們使用了三種不同的分類方法來對於此選擇方法進行驗證,分別是「基因演算法搭配 ROC 曲線下面積」、「支持向量機(SVM)」與「簡單貝氏分類法(Naive Bayes)」。選擇完畢之「屬性向量」透過三種分類方法進行訓練、建模之後,我們使用了兩個資料集加以驗證,實驗結果顯示我們提出的假設在某些情況下具有突出表現,但是在有些情形下,我們提出之假設並不比傳統的「主成份分析」效果好,可能是我們對於資料的了解與對於演算法的限制並不夠清楚,期望後續研究能針對這些部份進行探討與延伸。


    Binary classification method predicts the class of an object based on the associated feature vector. Traditional classification methods usually suffer from the high dimensionality of the feature vector, resulting in the need for decreasing feature vectors. There exist two major approaches to reducing the number of features. One is to select a subset of indigenous features which maintains the original meaning of each feature. The relevance among original features makes it difficult to find a proper subset of significant features from a large number of features, resorting to the need for random optimization algorithms. Another approach first transforms the original attributes to uncorrelated integrated features by the principal component analysis (PCA) and then sequentially search for the subset of significant integrated features. The second approach removes the relevance among integrated features, making the sequential search for the subset of significant integrated features feasible, while losing the interpret ability of significant features.

    In this study, we first transform the original features to uncorrelated integrated features by PCA and then rank the integrated features according to associated variances. To find the subset of significant integrated features, starting with the integrated features according to the corresponding ranks.
    For each subset of integrated features, a test score which is a linear combination of the integrated features is generated for classification. The coefficient on each integrated feature in the linear combination is determined such that the area under the Receiver Operating Characteristic(ROC) cure corresponding to the test score is maximized using the Genetic Algorithm(GA). Beside the self-developed classifier, we applied two other commonly used classifiers for comparison. Using the training data, the classification accuracy for each subset is evaluated and the subset with the largest classification accuracy is the final subset of significant integrated features used for classification. In addition to ranking the integrated features by the corresponding variances, we can also rank the integrated features by the corresponding Fisher Information, $R^2$ and AUC and then sequentially inflate the subset of integrated features according to the resulting ranks.

    Experimental results show that using Fisher Information has chances to get a better subset than merely PCA with variance. However, using PCA has a much consistant result. Using PCA can preduce a more consistance performance and more economy for calculating power. We assume that there are more to investigate further for the situation of using Fisher Information or other correlation methods as selection measurement to get a better classification performance than PCA variance.

    1 研究背景與目的 1.1 特徵選擇 1.1.1 利用屬性相似程度進行選擇 1.1.2 利用基因演算法進行選擇 1.1.3 最小冗餘最大相關性(mRMR)特徵選擇 1.2 特徵提取 1.2.1 主成份分析(Principle Component Analysis,PCA) 1.2.2 線性判別分析 (Linear discriminant Analysis,LDA) 2 資料集與研究方法 2.1 資料集簡介 2.1.1 Diabetic Retinopathy Debrecen (DRD) 2.1.2 Wisconsin Diagnostic Breast Cancer (WDBC) 2.2 資料處理方法 2.2.1 主成份分析(Principle Component Analysis, PCA) 2.2.2 費雪資訊(Fisher Information) 2.2.3 屬性相關性分析(R^2) 2.2.4 ROC 曲線下面積 2.3 分類法介紹 2.3.1 基因演算法與 ROC 曲線下面積 2.3.2 支持向量機 (Support Vector Machine, SVM) 2.3.3 簡單貝氏分類法 (Naive Bayes) 2.4 實驗步驟 3 結果與討論 3.1 Diabetic Retinopathy Debrecen (DRD) 3.2 Wisconsin Diagnostic Breast Cancer (WDBC) 3.3 結論 3.4 討論 參考文獻 IV圖目錄 SVM Example Accuracy of DRD using (GA/AUC) Accuracy of DRD using svm Accuracy of DRD using naive bayes Accuracy of WDBC using (GA/AUC) Accuracy of WDBC using svm Accuracy of WDBC using naive bayes 表目錄 Diabetic Retinopathy Debrecen Accuracy on DRD Dataset Wisconsin Diagnostic Breast Cancer Accuracy on WDBC Dataset PC Ranking for DRD PC Ranking for WDBC

    Blake, C.L., & Merz, C.J. 1998. UCI Repository of machine learning
    databases.
    [2] S. Aruna et al. (2011), Knowledge based analysis of various statisti-
    cal tools in detecting breast cancer.
    [3] Pearson, K. (1901). ”On Lines and Planes of Closest Fit to Systems
    of Points in Space”. Philosophical Magazine. 2 (11): 559‒572.
    [4] T. M. Phuong, Z. Lin et R. B. Altman. Choosing SNPs using feature
    selectio. Proceedings / IEEE Computational Systems Bioinformatics
    Conference, CSB. IEEE Computational Systems Bioinformatics Con-
    ference, pages301-309,2005.PMID 16447987
    [5] B. Duval, J.-K. Hao et J. C. Hernandez Hernandez. A memetic algo-
    rithm for gene selection and molecular classification of an cancer.
    In Proceedings of the 11th Annual conference on Genetic and evo-
    lutionary computation, GECCO ’09, pages 201-208, New York, NY,
    USA, 2009. ACM.
    [6] Peng, H. C.; Long, F.; Ding, C. (2005). ”Feature selection based
    on mutual information: criteria of max-dependency, max-relevance,
    and min-redundancy”. IEEE Transactions on Pattern Analysis and
    Machine Intelligence. 27 (8): 1226‒1238. doi: 10.1109/ TPAMI.
    2005.159. PMID 16119262. Program
    [7] Zhao, Z., Wang, L., Liu, H. and Ye, J. (2013). On Similarity Preserv-
    ing Feature Selection. IEEE Transactions on Knowledge and Data
    Engineering, 25(3), pp.619-632.
    [8] Xiaofei He,Deng Cai,Partha Niyogi.NIPS’05 Proceedings of the 18th
    International Conference on Neural Information Processing Sys-
    tems.Pages 507-514
    [9] Fisher, R. A. (1936). The Use Of Multiple Measurements In Taxo-
    nomic Problems. Annals of Eugenics, 7(2), 179-188. doi:10.1111/j.
    1469-1809.1936.tb02137.x
    [10] Edgeworth, F. (1908). On the Probable Errors of Frequency-
    Constants (Contd.). Journal of the Royal Statistical Society, 71(3),
    499-512. doi:10.2307/2339293
    [11] Pudil, P., Novovicova, J., & Kittler, J. (1994). Floating search methods
    in feature selection. Pattern Recognition Letters, 15(11), 1119-1125.
    doi:10.1016/0167-8655(94)90127-9
    [12] Lichman, M. (2013). UCI Machine Learning Repository [http://
    archive.ics.uci.edu/ml]. Irvine, CA: University of California, School
    of Information and Computer Science.
    18[13] Modi N., Ghanchi K. (2016) A Comparative Analysis of Feature Se-
    lection Methods and Associated Machine Learning Algorithms on
    Wisconsin Breast Cancer Dataset (WBCD). In: Satapathy S., Joshi
    A., Modi N., Pathak N. (eds) Proceedings of International Confer-
    ence on ICT for Sustainable Development. Advances in Intelligent
    Systems and Computing, vol 408. Springer, Singapore
    [14] Balint Antal, Andras Hajdu: An ensemble-based system for auto-
    matic screening of diabetic retinopathy, Knowledge-Based Systems
    60 (April 2014), 20-27.
    [15] Ch. Rakesh, , D.N.D.Harini, , M. Bhanu Sridhar ”An Empirical Anal-
    ysis of Classification Algorithms for Medical Data”
    [16] E.Osuna, R.Freund, and F. Girosi, “Training support vector ma-
    chines: Application to face detection”. Proceedings of computer
    vision and pattern recognition, Puerto Rico pp. 130‒136.1997.
    [17] Mohammad Darzi, Ali AsgharLiaei, Mahdi Hosseini, HabibollahAs-
    ghari. ”Feature Selection for Breast Cancer Diagnosis: A Case-
    Based Wrapper Approach. ” (2011), World academy of Science, En-
    gineering and Technology 77, 2011,pp 1142-1143.
    [18] D. Lavanya, “Ensemble Decision Tree Classifier for Breast Cancer
    Data,” International Journal of Information Technology Convergence
    and Services, vol. 2, no. 1, pp. 17-24, Feb. 2012.
    [19] Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992). ”A training algo-
    rithm for optimal margin classifiers”. Proceedings of the fifth an-
    nual workshop on Computational learning theory ‒COLT ’92. p. 144.
    ISBN 089791497X. doi:10.1145/130385.130401
    [20] Russell, Stuart; Norvig, Peter (2003) [1995]. Artificial Intelligence:
    A Modern Approach (2nd ed.). Prentice Hall. ISBN 978-0137903955.

    QR CODE