簡易檢索 / 詳目顯示

研究生: 蔡忠諺
Chung-Yen Tsai
論文名稱: 貝氏分類器特徵選取之研究
Feature Selection for A Bayes’ Classifier
指導教授: 楊維寧
Wei-Ning Yang
口試委員: 呂永和
Yung-Ho Leu
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 20
中文關鍵詞: 特徵選取主成分分析相關係數平方信息增益高斯樸素貝氏分類器多元高斯貝氏分類器
外文關鍵詞: Feature Selection, Principal Component Analysis, Squared Correlation, Information Gain, Gaussian naive Bayes, Multivariate Gaussian Bayes
相關次數: 點閱:544下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在分類過程中若使用不具信息的特徵可能降低模型的成效,因此藉由特徵與類別變數之間的相關係數平方和信息增益的特徵選取方法來決定保留的特徵,由於選擇的特徵通常具有相關性,導致不能只使用依序選取特徵的方法,因此我們首先使用主成分分析轉換成不相關的主成分,然後透過與類別變數的相關係數平方對主成分進行排序,最後依次選擇主成分使分類模型達到最高準確度,其中使用高斯樸素貝氏分類器和多元高斯貝氏分類器研究特徵選取後的組合,結果實證我們所提出的特徵選取組合可以顯著提高分類準確度。


    Using noninformative features for classification may deteriorate the efficiency of a classification model. Feature selection schemes based on the squared correlation between the feature and the class and the information gain corresponding to each individual feature are used to determine the informative features for the classification models. Since the selected informative features are often correlated that makes sequentially selecting features not feasible, the principal component analysis is first used to generate the uncorrelated principal components. Then rank the principal components by their squared correlations with the class. Finally sequentially select the principal components for the classification models to achieve the maximum accuracy. A Gaussian naive Bayes’ classification model and a multivariate Gaussian Bayes’ classification model are used to study the feature selection schemes. Empirical results show that the proposed feature selection schemes can achieve substantial improvements in the classification accuracy.

    摘要 I ABSTRACT II 誌 謝 III 目錄 IV 圖目錄 V 表目錄 V 第1章 緒論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 論文架構 3 第2章 資料集與研究方法 4 2.1 資料集簡介 4 2.1.1 Diabetic Retinopathy Debrecen(DRD) 4 2.1.2 Breast Cancer Wisconsin Diagnostic 4 2.1.3 Simulation Data Set 5 2.2 資料處理方法 5 2.2.1 特徵選取(Feature Selection) 5 2.2.1.1 r^2 5 2.2.1.2 信息增益 (Information Gain) 6 2.2.2 主成分分析 (Principal Components Analysis, PCA) 6 2.3 分類演算法 7 2.3.1 高斯樸素貝氏分類器(Gaussian Naive Bayes’ Classifier) 7 2.3.2 多元高斯貝氏分類器(Multivariate Gaussian Bayes’ Classifier) 8 第3章 實驗步驟與結果 9 3.1 實驗步驟 9 3.1.1 方法1: PCA分類之方法 9 3.1.2 方法2: 結合特徵選取與PCA分類之方法 11 3.2 實驗結果 12 3.2.1 Diabetic Retinopathy Debrecen實驗結果 12 3.2.2 Breast Cancer Wisconsin Diagnostic實驗結果 14 3.2.3 Simulation Data實驗結果 16 第4章 結論與未來展望 18 4.1 結論 18 4.2 未來展望 18 參考文獻 19

    [1] Huang, X., & Pan, W. (2003). Linear regression and two-class classification with gene expression data. Bioinformatics, 19(16), 2072-2078.

    [2] Allen, M. P. (1997). The problem of multicollinearity. Understanding regression analysis, 176-180.

    [3] Jain, A., & Zongker, D. (1997). Feature selection: Evaluation, application, and small sample performance. IEEE transactions on pattern analysis and machine intelligence, 19(2), 153-158.

    [4] Antal, B., & Hajdu, A. (2014). An ensemble-based system for automatic screening of diabetic retinopathy. Knowledge-based systems, 60, 20-27.

    [5] Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. University of Wisconsin-Madison Department of Computer Sciences.

    [6] Alexandropoulos, S. A. N., Kotsiantis, S. B., & Vrahatis, M. N. (2019). Data preprocessing in predictive data mining. The Knowledge Engineering Review, 34.

    [7] Khaire, U. M., & Dhanalakshmi, R. (2019). Stability of feature selection algorithm: A review. Journal of King Saud University-Computer and Information Sciences.

    [8] Venkatesh, B., & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and information technologies, 19(1), 3-26.

    [9] Salo, F., Nassif, A. B., & Essex, A. (2019). Dimensionality reduction with IG-PCA and ensemble classifier for network intrusion detection. Computer Networks, 148, 164-175.

    [10] Ramírez-Gallego, S., García, S., & Herrera, F. (2018). Online entropy-based discretization for data streaming classification. Future Generation Computer Systems, 86, 59-70.

    [11] Gárate-Escamila, A. K., El Hassani, A. H., & Andrès, E. (2020). Classification models for heart disease prediction using feature selection and PCA. Informatics in Medicine Unlocked, 19, 100330.

    [12] Holland, S. M. (2008). Principal components analysis (PCA). Department of Geology, University of Georgia, Athens, GA, 30602-2501.

    [13] Davis, B., & McDonald, D. (1995). An elementary proof of the local central limit theorem. Journal of Theoretical Probability, 8(3), 693-702.

    [14] Kim, S. B., Han, K. S., Rim, H. C., & Myaeng, S. H. (2006). Some effective techniques for naive bayes text classification. IEEE transactions on knowledge and data engineering, 18(11), 1457-1466.

    [15] Penny, W. (2014). Bayesian Inference for the Multivariate Normal. Wellcome Trust Center for Neuroimaging: University College, London.

    [16] Hawkins, Douglas M.,” The problem of overfitting,” Journal of Chemical Information and Modeling, 44(1), pp. 1–12, 1997.

    無法下載圖示 全文公開日期 2024/07/11 (校內網路)
    全文公開日期 2024/07/11 (校外網路)
    全文公開日期 2024/07/11 (國家圖書館:臺灣博碩士論文系統)
    QR CODE