簡易檢索 / 詳目顯示

研究生: 蔡嘉文
Chia-Wen Tsai
論文名稱: 植基於反應變數與主成份相關性之多項式核函數特徵選取方法
A Feature Selection Strategy of Polynomial Kernel Function Based on the Correlation between Response Variable and Principal Components
指導教授: 楊維寧
Wei-Ning Yang
口試委員: 呂永和
Yung-ho Leu
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 20
中文關鍵詞: 核函數多項式核線性不可分維度擴張主成分分析特徵選取決定係數羅吉斯迴歸
外文關鍵詞: Kernel Function, Polynomial Kernel, Non-linearly Separable Data, Dimensional Expansion, Principal Components Analysis, Feature Selection, Coefficient of Determination, Logistic Regression
相關次數: 點閱:752下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在機器學習(Machine learning)領域中,低維度且線性不可分的資料一直都是個很有挑戰性的任務,為了解決這一困難,可以應用核函數將屬性向量從低維度空間轉換到高維度空間。但是在增加屬性向量的同時,可能會造成屬性向量之間的多重共線性問題,因此,會採用主成分分析(PCA)來解決出現的多重共線性(multi-collinearity)問題。

    本研究中採用了多項式核函數來增加屬性向量的維度,並且應用PCA 將其轉換成不相關的主成分。但是與傳統的PCA相較之下,我們提出的方法,會將主成分按其和反應變數的相關性做排序,而不是傳統PCA的方式以變異量做排序,排序完後將其作為分類模型的候選特徵。

    分類模型準確度(accuracy)可以由反應變數及選取特徵之間的R^2亦即「決定係數」(coefficient of determination , R^2)來做出判斷,當我們將主成份依據相關係數之平方r^2排序並由Top_(1 )、 Top_(2 )、 Top_(3 ) ... 的順序逐步加入欲訓練之資料集時,由於主成份之間是互不相關的,所以r^2是可以累加計算R^2的,因此我們才決定利用r^2而非如傳統方法利用變異量(variance)來做為特徵選取指標。

    利用羅吉斯迴歸(Logistic Regression)作為分類模型,結合上我們對本研究提出之特徵選取方法,再利用真偽鈔資料集與眼球影像資料集進行實驗,實驗結果顯示本研究所提之特徵選取方法相較於傳統的核函數主成分分析方法對於分類準確度有顯著的提升,獲得了更高的分類效能。


    In machine learning, it is a challenging task to deal with low-dimensional and non-linearly separable data. One possible way to alleviate this difficulty is to apply kernel function to map the feature vector from low-dimensional space into a high-dimensional space. However, increasing the number of feature variables may incur the problem of multicollinearity among the feature variables. Principal Components Analysis (PCA) is then adopted to solve the multicollinearity problem emerged.

    In this research, a polynomial kernel function is used to increases the dimension of the feature vector. Then PCA is adopted to generate the uncorrelated principal components. Contrast to the traditional PCA, principal components are ranked by their correlations, instead of variances, with the response variable and then serve as the feature candidates for the classification model.

    The accuracy of the classification model depends on the response variable and the square of the correlation coefficient of the selected feature. That is the "coefficient of determination"(R^2). We rank the principal components according to the square of the correlation coefficient (r^2), and gradually adding Top_(1 ),Top_(2 ),Top_(3 ), ... to the dataset to be trained. Since the principal components are not related to each other, r^2 can be accumulated to calculate R^2. As a result, we decide to apply r^2 as a feature-selection indicator instead of the variance applied in traditional principal component analysis.

    A Logistic Regression model is used as the classification model combined with the proposed feature selection scheme. The proposed model is applied to the Banknote Authentication dataset and the diabetic retinopathy dataset for experimentation. Experimental results demonstrate that the proposed model achieves higher classification accuracy than the traditional kernel function PCA method.

    摘要 I Abstract II 誌謝 III 目錄 IV 圖目錄 V 表目錄 V 第1章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文架構 2 第2章 資料集與研究方法 3 2.1 資料集簡介 3 2.1.1 Banknote Authentication (Bank) 3 2.1.2 Diabetic Retinopathy Debrecen (DRD) 3 2.2 資料處理方法 4 2.2.1 核函數(Kernel Finction) 4 2.2.1.1 多項式核(Polynomial Kernel) 4 2.2.2 主成份分析(PCA, Principal Components Analysis) 6 2.2.3 核主成分分析 ( KPCA, Kernel Principal Components Analysis) 6 2.3 分類演算法 7 2.3.1 羅吉斯迴歸 (Logistic Regression) 7 第3章 實驗步驟與結果 8 3.1 實驗步驟 8 3.1.1 方法1: PCA應用於分類問題之方法 8 3.1.2 方法2: Kernel PCA之方法 8 3.1.3 方法3: 結合Kernel PCA與PCA應用於分類問題之方法 9 3.2 實驗結果 10 3.2.1 Banknote Authentication (Bank)實驗結果 10 3.2.2 Diabetic Retinopathy Debrecen (DRD)實驗結果 11 第4章 結論與未來展望 13 4.1 結論 13 4.2 未來展望 13 參考文獻 14

    [1] Stuart J. Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. 2010.

    [2] Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.

    [3] Michael Patrick Allen, “The problem of multicollinearity,” in Understanding Regression Analysis, Springer, Boston, MA., 1997.

    [4] Svante Wold, Kim Esbensen and Paul Geladi, “Principal Component Analysis,” Chemometrics and Intelligent Laboratory Systems, Vol. 2, Issues 1–3, pp. 37-52, Aug. 1987.

    [5] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola, “Kernel methods in machine learning,” Ann. Statist., Vol. 36, pp. 1171-1220, 3 Num. 2008.

    [6] Campbell, Colin. "An introduction to kernel methods." Studies in Fuzziness and Soft Computing 66 (2001): 155-192.

    [7] Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Muller, “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Computation, vol. 10, no. 5, pp. 1299-1319, 1 Jul. 1998.

    [8] Dua, D. and Graff, C., “banknote authentication Data Set,” UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2019.

    [9] Balint Antal, Andras Hajdu: An ensemble-based system for auto-matic screening of diabetic retinopathy, Knowledge-Based Systems 60 (April 2014), 20-27.

    [10] Hein, M. and O. Bousquet. “Hilbertian Metrics and Positive Definite Kernels on Probability Measures,” AISTATS, 2005.

    [11] N. Bourbaki, “Chapter V: Hilbert spaces (elementary theory),” in Topological Vector Spaces, Berlin: Springer-Verlag, 1987.

    [12] Liang, Zhiyu, and Yoonkyung Lee. "Eigen‐analysis of nonlinear PCA with polynomial kernels." Statistical Analysis and Data Mining: The ASA Data Science Journal 6.6 (2013): 529-544.

    [13] Chang, Yin-Wen, et al. "Training and testing low-degree polynomial data mappings via linear SVM." Journal of Machine Learning Research 11.4 (2010).

    [14] Berkson, J. (1944). Application of the Logistic Function to Bio-Assay. Journal of the American Statistical Association, 39(227), 357-365.

    無法下載圖示 全文公開日期 2024/07/06 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE