研究生: |
楊超霆 Chao-Ting Yang |
---|---|
論文名稱: |
基於P值與主成分分析之樸素貝葉斯分類演算法 A Naive Bayes’ Classifier Based on The p-values of Features And Principal Component Analysis |
指導教授: |
楊維寧
Wei-Ning Yang |
口試委員: |
陳雲岫
Yun-Shiow Chen 呂永和 Yung-Ho Leu |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 13 |
中文關鍵詞: | 高斯分布 、P值 、多重共線性 、主成分分析 、樸素貝葉斯分類器 |
外文關鍵詞: | Gaussian distributions, P-value, Multicollinearity, PCA, naive Bayes’ classifier |
相關次數: | 點閱:331 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
以往在實驗中會假設資料是高斯分布,但真實資料的研究中發現這是不合理的。因此本研究提出一種使用「P值」(P-Value)的演算方法,不對特徵進行分布假設之情形下,在訓練資料中計算特徵在類別的比例,因此P值更能有效的對訓練資料進行分類。P值的大小能反映出測試資料的特徵與類別的差異程度。當P值越大則代表測試資料與該類別更接近,因此更容易被分類至此類別。為了減少特徵的「多重共線性」(Multicollinearity)問題,會先使用「主成分分析」(Principal Component Analysis)將所有特徵轉換為互不相關的「主成分」(Principal Component);而後根據訓練資料經過主成分分析的結果分布,來計算每一個主成分的P值。本研究針對樸素貝葉斯分類器結合P值與主成分分析,實驗結果顯示,該方法能有效提升樸素貝葉斯分類器分類效果。
Empirical studies on real datasets often indicate the assumptions of Gaussian distributions of features may not be plausible. Without distribution assumptions on the features, the proportion of feature values belonging to a specific class in the training data which are more extreme than the the feature value of the testing instance, called p-value, is used to judge the likelihood that the testing instance falls in the specific class. The size of a p-value reflects the discrepancy between the feature value of the testing instance and the expected feature value in the class. For a specific feature, a testing instance with a large p-value indicates the testing instance is consistent with the expected instance in the class and therefore is more likely to be classified into the class. To alleviate the problem of multicollinearity of features, the principal component analysis is first used to transform the original features into uncorrelated principal components. Then the p-value corresponding to each principal component is evaluated from the empirical distribution of the principal component in the training data. The proposed method which combines the p-values of features and the principal component analysis is studied for a naive Bayes’ classifier. Empirical experiments show that the proposed method achieves improvements over the Gaussian naive Bayes’ classifier.
[1] Bryant, Fred B., Yarnold, Paul R., ” Principal-components analysis and exploratory and confirmatory factor analysis ” in L. G. Grimm & P. R. Yarnold (Eds.), Reading and understanding multivariate statistics, pp. 99–136, 1995.
[2] David A. Field, ” Laplacian smoothing and Delaunay triangulations” Numerical Methods in Biomedical Engineering, 4(6), pp. 709-712, 1988.
[3] Edward R. Mansfield, Billy P. Helms, ” Detecting Multicollinearity ” The American Statistician, 36(3), pp. 158-160, 1982.
[4] Hawkins, Douglas M.,” The problem of overfitting,” Journal of Chemical Information and Modeling, 44(1), pp. 1–12, 1997.
[5] Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.
[6] Stuart J. Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. 2010.
[7] UCI Machine Learning Repository: banknote authentication Data Set. [Online: http://archive.ics.uci.edu/ml/datasets/banknote+authentication].
[8] UCI Machine Learning Repository: Diabetic Retinopathy Debrecen Data Set Data Set. [Online:https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set].
[9] Xiao Fang., ” Inference-Based Naïve Bayes: Turning Naïve Bayes Cost-Sensitive” IEEE Transactions on Knowledge and Data Engineering, 25(10), pp. 2302 - 2313, 2013.
[10] Xiaoqun Wang, Ian H. Sloan, ” Brownian bridge and principal component analysis: towards removing the curse of dimensionality” IMA Journal of Numerical Analysis, 27(4), pp. 631-654, 2007.