研究生: |
蔡忠諺 Chung-Yen Tsai |
---|---|
論文名稱: |
貝氏分類器特徵選取之研究 Feature Selection for A Bayes’ Classifier |
指導教授: |
楊維寧
Wei-Ning Yang |
口試委員: |
呂永和
Yung-Ho Leu 陳雲岫 Yun-Shiow Chen |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 20 |
中文關鍵詞: | 特徵選取 、主成分分析 、相關係數平方 、信息增益 、高斯樸素貝氏分類器 、多元高斯貝氏分類器 |
外文關鍵詞: | Feature Selection, Principal Component Analysis, Squared Correlation, Information Gain, Gaussian naive Bayes, Multivariate Gaussian Bayes |
相關次數: | 點閱:544 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在分類過程中若使用不具信息的特徵可能降低模型的成效,因此藉由特徵與類別變數之間的相關係數平方和信息增益的特徵選取方法來決定保留的特徵,由於選擇的特徵通常具有相關性,導致不能只使用依序選取特徵的方法,因此我們首先使用主成分分析轉換成不相關的主成分,然後透過與類別變數的相關係數平方對主成分進行排序,最後依次選擇主成分使分類模型達到最高準確度,其中使用高斯樸素貝氏分類器和多元高斯貝氏分類器研究特徵選取後的組合,結果實證我們所提出的特徵選取組合可以顯著提高分類準確度。
Using noninformative features for classification may deteriorate the efficiency of a classification model. Feature selection schemes based on the squared correlation between the feature and the class and the information gain corresponding to each individual feature are used to determine the informative features for the classification models. Since the selected informative features are often correlated that makes sequentially selecting features not feasible, the principal component analysis is first used to generate the uncorrelated principal components. Then rank the principal components by their squared correlations with the class. Finally sequentially select the principal components for the classification models to achieve the maximum accuracy. A Gaussian naive Bayes’ classification model and a multivariate Gaussian Bayes’ classification model are used to study the feature selection schemes. Empirical results show that the proposed feature selection schemes can achieve substantial improvements in the classification accuracy.
[1] Huang, X., & Pan, W. (2003). Linear regression and two-class classification with gene expression data. Bioinformatics, 19(16), 2072-2078.
[2] Allen, M. P. (1997). The problem of multicollinearity. Understanding regression analysis, 176-180.
[3] Jain, A., & Zongker, D. (1997). Feature selection: Evaluation, application, and small sample performance. IEEE transactions on pattern analysis and machine intelligence, 19(2), 153-158.
[4] Antal, B., & Hajdu, A. (2014). An ensemble-based system for automatic screening of diabetic retinopathy. Knowledge-based systems, 60, 20-27.
[5] Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. University of Wisconsin-Madison Department of Computer Sciences.
[6] Alexandropoulos, S. A. N., Kotsiantis, S. B., & Vrahatis, M. N. (2019). Data preprocessing in predictive data mining. The Knowledge Engineering Review, 34.
[7] Khaire, U. M., & Dhanalakshmi, R. (2019). Stability of feature selection algorithm: A review. Journal of King Saud University-Computer and Information Sciences.
[8] Venkatesh, B., & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and information technologies, 19(1), 3-26.
[9] Salo, F., Nassif, A. B., & Essex, A. (2019). Dimensionality reduction with IG-PCA and ensemble classifier for network intrusion detection. Computer Networks, 148, 164-175.
[10] Ramírez-Gallego, S., García, S., & Herrera, F. (2018). Online entropy-based discretization for data streaming classification. Future Generation Computer Systems, 86, 59-70.
[11] Gárate-Escamila, A. K., El Hassani, A. H., & Andrès, E. (2020). Classification models for heart disease prediction using feature selection and PCA. Informatics in Medicine Unlocked, 19, 100330.
[12] Holland, S. M. (2008). Principal components analysis (PCA). Department of Geology, University of Georgia, Athens, GA, 30602-2501.
[13] Davis, B., & McDonald, D. (1995). An elementary proof of the local central limit theorem. Journal of Theoretical Probability, 8(3), 693-702.
[14] Kim, S. B., Han, K. S., Rim, H. C., & Myaeng, S. H. (2006). Some effective techniques for naive bayes text classification. IEEE transactions on knowledge and data engineering, 18(11), 1457-1466.
[15] Penny, W. (2014). Bayesian Inference for the Multivariate Normal. Wellcome Trust Center for Neuroimaging: University College, London.
[16] Hawkins, Douglas M.,” The problem of overfitting,” Journal of Chemical Information and Modeling, 44(1), pp. 1–12, 1997.