簡易檢索 / 詳目顯示

研究生: 王詔緯
Zhao-Wei Wang
論文名稱: 植基於反應變數與主成份相關性之特徵選取方法
A Feature Selection Strategy Based on the Correlation with Response Variable and Principal Components
指導教授: 楊維寧
Wei-Ning Yang
口試委員: 呂永和
Yung-Ho Leu
陳雲岫,
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 19
中文關鍵詞: 高維度資料分析主成份分析特徵選取相關係數平方複相關係數平方線性判別分析羅吉斯迴歸
外文關鍵詞: High Dimensional Data Analysis, Squared Multiple Correlation
相關次數: 點閱:356下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在機器學習領域中,高維度資料分析是一個很有挑戰性的任務,隨著特徵數量的提高,分類模型因此需要大量的運算成本,還有可能陷入維度詛咒(curese of dimensionality)而發生過擬合(overfitting)的情況,特徵選取透過選取影響力較高的特徵可以有效解決上述問題。

    本研究目的在於提出一種植基於「反應變數」(response variable)與「主成份」(principal components)相關性的特徵選取方法,首先利用「主成份分析」(Principal Components Analysis, PCA)將原本相關性過高的「屬性向量」轉換為互不相關的「主成份向量」,其目的主要在於去除「屬性變數」之間的相關性太高時所導致的多元共線性(multi-collinearity)問題。

    分類模型準確度(accuracy)可以由反應變數及選取特徵之間的R^2亦即複相關係數平方(squared multiple correlations, SMC)來做出判斷,當我們將主成份依據相關係數平方r^2排序並由Top 1、Top 2、Top 3 ... 的順序逐步加入欲訓練之資料集時,由於主成份之間是互不相關的,所以r^2是可以累加計算R^2的,因此我們決定利用r^2而非如傳統方法利用變異量(variance)來做為特徵選取指標。

    為了瞭解我們所提出之特徵選取方法的表現,我們利用線性判別分析(Linear Discrimination Analysis, LDA)以及羅吉斯迴歸(Logistic Regression)對本研究提出之特徵選取方法與依據解釋變異量所選取的特徵,分別建構出分類的模型,再利用酒類成分資料集與眼球影像資料集進行實驗,實驗結果顯示本研究所提之特徵選取方法相比傳統方法對於分類準確度有著顯著的提升,並且有效提高分類器效能。


    In machine learning, high dimensional data analysis is a challenging task. With enormous number of features available, a learning model may be computationally inefficient and encounter the problem of overfitting due to the curse of dimensionality. Feature selection provides an effective way to solve these problems by discarding redundant features, which can improve the performance of the learning model and reduce the computational cost while maintain the accuracy.

    When selecting features for classification problem, we proposed a novel feature selection procedure based on the correlation between the response variable and the principal components which are transformed from the original features. Principal component analysis (PCA) transform the original features into uncorrelated principal components which are used as features for classification to avoid the multi-collinearity problem.

    The accuracy of the classification model depends on the squared multiple correlation between the response variable and the set of features used in the model. When sequentially selecting the uncorrelated principal components as features, the squared multiple correlation can be adding the squared correlation between the response variable and the selected principal component. Since larger correlation indicates higher classification ability, principal components are ranked based on the correlations with response variable, instead of the variances.

    For studying the performance of the proposed procedure for selecting the classification features, we used linear discrimination analysis (LDA) and logistic regression to build the classification models. The classification models are then applied on the wine quality dataset and the diabetic retinopathy dataset for experimentation. Experimental results demonstrate that the proposed procedure achieves higher classification accuracy than the traditional strategy which ranks the principal components based on the variance of the principal components and substantially reduce the number of features required.

    目錄 摘要 I Abstract II 誌謝 III 目錄 IV 圖目錄 V 表目錄 V 第1章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文架構 2 第2章 資料集與研究方法 3 2.1 資料集簡介 3 2.1.1 Wine Quality (WQ) 3 2.1.2 Diabetic Retinopathy Debrecen (DRD) 3 2.2 資料處理方法 4 2.2.1 維度縮減 (Dimension reduction) 4 2.2.2 主成份分析(Principal Components Analysis, PCA) 4 2.3 分類演算法 4 2.3.1 線性判別分析 (Linear Discrimination Analysis, LDA) 4 2.3.2 羅吉斯迴歸 (Logistic Regression, LR) 5 第3章 實驗步驟與結果 6 3.1 Wine Quality (WQ) 實驗結果 7 3.2 Diabetic Retinopathy Debrecen(DRD) 實驗結果 9 第4章 結論與未來展望 11 4.1 結論 11 4.2 未來展望 11 參考文獻 12 圖目錄 圖1 論文架構圖 2 圖2 WQ-LDA實驗結果圖 8 圖3 WQ-LR 實驗結果圖 8 圖4 DRD-LDA 實驗結果圖 10 圖5 DRD-LR 實驗結果圖 10 表目錄 表1 Wine Quality 3 表2 Diabetic Retinopathy Debrecen 3 表3 PC Ranking for WQ 7 表4 PC Ranking for DRD 9

    參考文獻
    [1] S. B. Kotsiantis (2007). Supervised Machine Learning: A Review of Classification Techniques

    [2] Jamal I. Daoud (2017). Multicollinearity and Regression Analysis

    [3] Svante Wold, Kim Esbensen and Paul Geladi (1987). Principal Component Analysis

    [4] Lichman, M. (2013). UCI Machine Learning Repository [http://
    archive.ics.uci.edu/ml]. Irvine, CA: University of California, School
    of Information and Computer Science.

    [5] KAISER, H. E (1960). The application of electronic computers to factor analysis. Education & Psychological Measurement, 20, 14I-151.

    [6] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos and José Reis, “Modeling wine preferences by data mining from physicochemical properties” In Decision Support Systems, Volume 47, Issue 4, November 2009, Pages 547-553

    [7] Balint Antal, Andras Hajdu: An ensemble-based system for auto-matic screening of diabetic retinopathy, Knowledge-Based Systems 60 (April 2014), 20-27.

    [8] Fodor, I K (2002). A Survey of Dimension Reduction Techniques.

    [9] Girish Chandrashekar and Ferat Sahin. "A survey on feature selection method" Computers & Electrical Engineering, Volume 40, Issue 1, January 2014, Pages 16-28.

    [10] J. Mao, A. Jain, "Artificial neural networks for feature extraction and multivariate data projection", IEEE Transactions on Neural Networks, vol. 6, no. 2, pp. 296-317, Mar. 1995.

    [11] Fisher, R.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7(7), 179_188 (1936)

    [12] Berkson, J. (1944). Application of the Logistic Function to Bio-Assay. Journal of the American Statistical Association, 39(227), 357-365.

    [13] Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).

    無法下載圖示 全文公開日期 2023/06/22 (校內網路)
    全文公開日期 2025/06/22 (校外網路)
    全文公開日期 2025/06/22 (國家圖書館:臺灣博碩士論文系統)
    QR CODE