簡易檢索 / 詳目顯示

研究生: 胡毓芯
YU-SIN HU
論文名稱: 根據主成分與反應變數之相關性進行線性判別分析
Linear Discriminant Analysis Based on the Correlations between Principal Components and Response Variable
指導教授: 洪政煌
Cheng-Huang Hung
楊維寧
Wei-Ning Yang
口試委員: 洪政煌
楊維寧
呂永和
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 13
中文關鍵詞: 主成分分析特徵選取決定係數線性判別分析
外文關鍵詞: Principal Component Analysis, Feature Selection, Coefficient of Determination, Linear Discrimination Analysis
相關次數: 點閱:230下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在機器學習領域中,當欲分類的資料量大且特徵屬性繁多時,分類模型需要大量的運算成本,亦容易陷入「維度詛咒」(curse of dimensionality),使用「主成分分析」法能有效縮減維度(Dimension reduction),而如何篩選有效的特徵並降低運算成本也成為一個重要的問題。
本研究提出一個基於「主成分」(Principal Components)與「反應變數」(Response Variable)之相關性的特徵選取方法,其主要為使用「主成份分析」法(Principal Components Analysis, PCA)將原始可能互為相關的「屬性變數」之「屬性向量」轉化為互不相關的「主成份向量」,消除「屬性變數」之間相關性過高時會導致的多重共線性(multi-collinearity)問題。
透過反應變數與選取特徵的相關係數(correlation coefficient)之平方來計判斷分類模型的準確度,將主成分依「相關係數」(correlation coefficient)「r^2」由大到小排序,逐步將主成分加入欲訓練的資料集中,使分類器準確度最大化。因此我們決定以「r^2」作為特徵選取的方法,而不是傳統主成分分析以「變異數」(variance)做為特徵選取的方法。
為瞭解本研究提出的特徵選取方法之效能,本研究使用「線性判別分析」(Linear Discrimination Analysis, LDA)對本研究所提出的依「r^2」作為特徵選取方法以及傳統依「變異數」做為特徵選取方法,各自建立分類模型,在使用糖尿病視網膜病變資料集與乳癌資料集進行實驗驗證,實驗結果顯示,本研究所提的依「r^2」作為特徵選取方法優於傳統依「變異數」特徵選取方法,能有效減少特徵使用數量,留下分類效果好的主成分,踢除分類效果不佳的主成分,進而提高分類器準確度,也提高分類器效能。


In the field of machine learning, when the amount of data to be categorized is large and the attributes of the features are numerous, the classification model requires a large amount of computational cost, and it is easy to fall into the "curse of dimensionality", the use of the "Principal Component Analysis" method can effectively reduce the dimensionality, and how to filter effective features and reduce the cost of computation has also become an important issue.
In this study, we propose a feature selection method based on the correlation between principal components and response variables. The main purpose of this method is to eliminate the problem of multicollinearity when the correlation between attribute variables is too high by transforming the original attribute vectors of potentially correlated attribute variables into uncorrelated principal component vectors using Principal Component Analysis (PCA).
The accuracy of the classification model is determined by the square of the correlation coefficient between the response variable and the selected features, and the principal components are sorted according to the correlation coefficient "r^2" from the largest to the smallest, and the principal components are gradually added into the dataset to be trained, so as to maximize the accuracy of the classifiers. Therefore, we decided to use "r^2" as the feature selection method instead of the traditional principal component analysis which uses "variance" as the feature selection method.
To understand the effectiveness of the feature selection methods proposed in this study, Linear Discrimination Analysis (LDA) was used to establish classification models for the r^2-based feature selection method and the traditional variable-based feature selection method, and experiments were conducted using the diabetic retinopathy dataset and the breast cancer dataset to verify the effectiveness of the two methods. The experimental results show that the feature selection method based on "r^2" proposed in this study is superior to the traditional feature selection method based on "variance", which can effectively reduce the number of features used, leave the principal components with good classification effect, and kick out those with poor classification effect, thus improving the accuracy and the efficiency of the classifier.

中文摘要 III 英文摘要 IV 誌 謝 V 圖目錄 VII 表目錄 VII 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文架構 2 第二章 資料集與研究方法 3 2.1 資料集簡介 3 2.1.1 Diabetic Retinopathy Debrecen 3 2.1.2 Breast Cancer Wisconsin 3 2.2 資料處理方法 4 2.2.1 降維(Dimension reduction) 4 2.2.2 主成分分析 (Principal Components Analysis, PCA) 4 2.3 分類演算法 5 2.3.1 線性判別分析(Linear Discrimination Analysis, LDA) 5 第三章 實驗步驟與結果 6 3.1 實驗步驟 6 3.2 實驗結果 7 3.2.1 Diabetic Retinopathy Debrecen實驗結果 7 3.2.2 Breast Cancer Wisconsin實驗結果 9 第四章 結論與未來展望 11 4.1 結論 11 4.2 未來展望 11 參考文獻 13

[1] Alpaydin, Ethem. Introduction to machine learning. MIT press, 2020.
[2] Kuo, Frances Y., and Ian H. Sloan. "Lifting the curse of dimensionality." Notices of the AMS 52.11 (2005): 1320-1328.
[3] Daoud, Jamal I. "Multicollinearity and regression analysis." Journal of Physics: Conference Series. Vol. 949. No. 1. IOP Publishing, 2017.
[4] Jolliffe, Ian T., and Jorge Cadima. "Principal component analysis: a review and recent developments." Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences 374.2065 (2016): 20150202.
[5] Antal,Balint and Hajdu,Andras. (2014). Diabetic Retinopathy Debrecen. UCI Machine Learning Repository. https://doi.org/10.24432/C5XP4P.
[6] Wolberg,William, Mangasarian,Olvi, Street,Nick, and Street,W.. (1995). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B.
[7] Van Der Maaten, L., Postma, E. O., & van den Herik, H. J. (2009). Dimensionality reduction: A comparative review. Journal of Machine Learning Research, 10(66-71), 13.
[8] Jović, Alan, Karla Brkić, and Nikola Bogunović. "A review of feature selection methods with applications." 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO). Ieee, 2015.
[9] Khalid, S., Khalil, T., & Nasreen, S. (2014, August). A survey of feature selection and feature extraction techniques in machine learning. In 2014 science and information conference (pp. 372-378). IEEE.
[10] Abdi, Hervé, and Lynne J. Williams. "Principal component analysis." Wiley interdisciplinary reviews: computational statistics 2.4 (2010): 433-459.
[11] Balakrishnama, Suresh, and Aravind Ganapathiraju. "Linear discriminant analysis-a brief tutorial." Institute for Signal and information Processing 18.1998 (1998): 1-8.
[12] Prince, Simon JD, and James H. Elder. "Probabilistic linear discriminant analysis for inferences about identity." 2007 IEEE 11th international conference on computer vision. IEEE, 2007.

QR CODE