簡易檢索 / 詳目顯示

研究生: 康銘麟
Ming-Lin Kang
論文名稱: 植基於反應變數與主成分相關性之簡單貝氏分類法
Naïve Bayes Classifier Based on the Correlations between Response Variable and Principal Components
指導教授: 楊維寧
Wei-Ning Yang
口試委員: 呂永和
Yung-Ho Leu
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 21
中文關鍵詞: 主成分分析特徵選取複相關係數決定係數簡單貝氏分類器AUC
外文關鍵詞: Principal Components Analysis, Feature Selection, Squared Multiple Correlation, Coefficient of determination, Naïve Bayes Classifier, area under the operating characteristic curve
相關次數: 點閱:1896下載:17
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 因為近年來「嚴重特殊傳染性肺炎」(COVID-19)病毒於全球蔓延,所以本研究欲進行醫療資料集分類,隨著資料特徵數量的提高,因此分類模型需要大量的運算成本,並且可能使分類模型陷入「維度詛咒」(curse of dimensionality)的問題,藉由「主成分分析法」(principal component analysis, PCA),可有效進行維度縮減(Dimension reduction),解決上述問題。

    本研究提出一個基於「主成分」(Principal Components)與「反應變數」(Response Variable)相關性的特徵選取方法,透過「主成分分析」(principal component analysis, PCA)將原始相關性較高「屬性變數」的「屬性向量」轉換成任兩個互不相關的「主成分向量」,其目的在於消除「屬性變數」之間相關性太高時而導致的「多重共線性」(Multi-collinearity)問題。

    分類模型的準確度(accuracy)可透過選取特徵以及反應變數之間的「複相關係數」(Squared Multiple Correlations, SMC)平方R^2進行判斷,先將主成分依「相關係數」(correlation coefficient)平方r^2排序,並依照Top 1、Top 2、Top3 … 的順序,逐步加入資料集中,因為主成分之間是互不相關的,所以可以透過r^2來累加計算R^2,我們決定使用r^2並非傳統主成分分析方法利用「變異量」(variance)來做為特徵選取的指標。

    為了瞭解本研究所提出之特徵選取方法的效能,本研究使用「簡單貝氏分類器」(Naïve Bayes Classifier)對本研究所提出的特徵選取方法、依據變異量所選取的特徵以及依據「ROC曲線下面積」(area under the operating characteristic curve, AUC)的特徵選取方法,分別建立分類模型,再利用心臟病醫療資料集進行驗證,實驗結果顯示,本研究所提之特徵選取方法優於依據變異量特徵選取方法以及依據ROC曲線下面積(area under the operating characteristic curve, AUC)的特徵選取方法,有效減少特徵使用的數量,且分類模型準確度顯著提升。


    Classification methods classify objects by exploring the relationship between the class and the features of the object.

    Since the number of features is often large, classification becomes computationally intensive and therefore reducing the number of features needs to be addressed.

    Principal component analysis (PCA) transforms coupled features into uncorrelated principal components. A subset of significant principal components is used for classification.

    Traditional PCA ranks principal components by variances without considering the relationship between the individual principal component and the class variable.

    A selection scheme based on the squared correlation between each individual principal component and the class variable is proposed to select significant principal components for classification.

    Another selection scheme which ranks principal components by the area under the operating characteristic curve (AUC) corresponding to individual principal component is also investigated.

    The proposed selection schemes used by the Naïve Bayes classifier are investigated on a heart disease dataset. Empirical results show the proposed selection scheme based on the squared correlations outperforms the traditional selection scheme based on the variances.

    摘要 I Abstract II 誌謝 III 目錄 IV 圖目錄 V 表目錄 V 第1章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文架構 2 第2章 資料集與研究方法 3 2.1 資料集簡介 3 2.1.1 Heart Disease (HD) 3 2.2 資料處理方法 3 2.2.1 維度縮減 (Dimension Reduction) 4 2.2.2 主成分分析 (Principal Components Analysis, PCA) 4 2.2.3 ROC 曲線下面積 (area under the operating characteristic curve, AUC) 5 2.3 分類演算法 5 2.3.1 簡單貝氏分類器 (Naïve Bayes Classifier, NB) 5 第3章 實驗步驟與結果 7 3.1 Heart Disease (HD) 實驗結果 9 第4章 結論與未來展望 11 4.1 結論 11 4.2 未來展望 11 參考文獻 12

    [1] Miroslav Kubat, Ivan Bratko, and Ryszard Michalski (1996). A Review of Machine Learning Methods.

    [2] Roshan Kumari and Saurabh Kr. Srivastava (2017). Machine Learning: A Review on Binary Classification.
    [3] Jamal I. Daoud (2017). Multicollinearity and Regression Analysis.

    [4] Svante Wold, Kim Esbensen and Paul Geladi (1987). Principal Component Analysis.

    [5] S. B. Kotsiantis (2007). Supervised Machine Learning: A Review of Classification Techniques.

    [6] Andras Janosi, M.D. ( Hungarian Institute of Cardiology. Budapest ), William Steinbrunn, M.D. ( University Hospital, Zurich, Switzerland ), Matthias Pfisterer, M.D. (University Hospital, Basel, Switzerland ) and Robert Detrano, M.D., Ph.D. (V.A. Medical Center, Long Beach and Cleveland Clinic Foundation ), “Heart Disease Data Set,” UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

    [7] I.K. Fodor (2002). A Survey of Dimension Reduction Techniques.

    [8] B. Venkatesh, J. Anuradha, “A Review of Feature Selection and Its Methods”, CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 19, No 1, 2019.

    [9] J. Mao, A.K. Jain, "Artificial neural networks for feature extraction and multivariate data projection", IEEE Transactions on Neural Networks, vol. 6, no. 2, pp. 296-317, Mar. 1995.

    [10] C.-L. Cheng, Shalabh, G. Garg, “Coefficient of determination for multiple measurement error models”, Journal of Multivariate Analysis Volume 126, April 2014, Pages 137-152.

    [11] Seong Ho Park, M.D., Jin Mo Goo, M.D., and Chan-Hee Jo, PhD(2004). Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists.

    [12] Hsu, C.N., Huang, H.J., and Wong, T.T., “Implications of the Dirichlet assumption for discretization of continuous attributes in naïve Bayesian classifiers”. Machine Learning, 53, 235-263, 2003.

    QR CODE