研究生: |
黃柏崴 Bo-Wei Huang |
---|---|
論文名稱: |
植基於屬性p值之樸素貝氏分類器 A Naive Bayes’ Classifier Based on The p-values of Features |
指導教授: |
楊維寧
Wei-Ning Yang |
口試委員: |
呂永和
Yung-Ho Leu 陳雲岫 Yun-Shiow Chen |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 26 |
中文關鍵詞: | 高斯樸素貝氏分類器 、閾值α 、高斯分佈 、P值(P-value) 、交叉驗證 |
外文關鍵詞: | Gaussian Naive Bayesian Classifier, threshold, Gaussian Distribution, p-value, Cross-Validation |
相關次數: | 點閱:273 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
高斯樸素貝氏分類器基於每個屬性服從高斯分佈的假設來評估每個類別在各別屬性的概似函數(likelihood)。但關於真實資料集的實際研究中表明高斯分佈的假設可能是不合理的。本研究目的植基於樸素貝氏分類器透過p值(p-value)代替與每個屬性相對應的概似函數。在一個特定的類別,p值作為測試樣本中的屬性值是在訓練資料集中的屬性值比測試樣本中的屬性值更極端的比例,它是根據對應屬性的經驗分佈(Empirical Distribution)計算得出的。p值的大小反映了測試樣本中的屬性值與類別中的所預期屬性值之間的差異。對於特定的屬性,p 值大的測試樣本表示該測試樣本與該類別中的預期一致,因此更有可能被分類到該類別中。本研究所提出的樸素貝氏分類器將傳統樸素貝分類器中的概似函數乘積替換為p值的乘積。為了解決因為在高斯樸素貝氏分類器因不平衡資料集造成正陽性率(True positive rate)過低問題,使用10 fold交叉驗證方法來決定樸素貝氏分類的閾值α (threshold),以最大化F1 measure。實驗結果顯示與高斯樸素貝氏分類器相比,所提出的分類器取得了實質性的改進。
Gaussian naive Bayes’ classifiers evaluate the likelihood of individual feature for each class based on the assumption that each feature follows a Gaussian distribution. Empirical studies on real datasets often indicate the assumptions of Gaussian distributions may not be plausible. This research purposes a naive Bayes’ classifier where the likelihood corresponding to each feature is replaced by the p-value. For a specific class, the p-value of the feature value of a testing instance is the proportion of feature values more extreme than the feature value of the testing instance, which is computed based on the empirical distribution of the corresponding feature. The size of a p-value reflects the discrepancy between the feature value of the testing instance and the expected feature value in the class. For a specific feature, a testing instance with a large p-value indicates the testing instance is consistent with the expected instance in the class and therefore is more likely to be classified into the class. The proposed naive Bayes’ classifier replaces the product of likelihoods in the conventional naive Bayes’ classifier by the product of p-values. To alleviate the problem of low true positive rate for gaussian naive Bayesian of imbalance datasets, a 10 -fold cross validation scheme is used to determine the threshold for the naive Bayes’ classifier that achieves the maximum F1 measure. Empirical results show that the proposed classifier achieves substantial improvements when compared with the Gaussian naive Bayes’ classifier.
[1] Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.
[2] KR Koch, Bayes' theorem. Bayesian Inference with Geodetic Applications,1990
[3] Golam Kibria , Mohammad Shakil , Normal Distribution. Normal and Student´s t Distributions and Their Applications , pp 7–50, February 2014
[4] M. Kac, On Deviations between Theoretical and Empirical Distributions. Proceedings of the National Academy of Sciences , pp 252-257, May , 1949
[5] Partha Jyoti Hazarika , M. Masoom Ali, A multimodal skewed extension of normal distribution: its properties and applications. Statistics, pp859-877, Mar 2014
[6] Kevin P. Murphy, Naive Bayes classifiers. University of British Columbia, 2006
[7] Hawkins, Douglas M.,” The problem of overfitting,” Journal of Chemical Information and Modeling, 44(1), pp. 1–12, 1997.
[8] Michael WBrowne, Cross-Validation Methods. Journal of Mathematical Psychology
Volume 44, Issue 1, Pages 108-132, March 2000
[9] Balint Antal, Andras Hajdu, “Diabetic Retinopathy Debrecen Data Set,” UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. An ensemble-based system for automatic screening of diabetic retinopathy, Knowledge-Based Systems 60, pp 20-27, April 2014
[10] Dua.D , Graff, C, “Starlog (Australian Credit Approval) Data Set,” UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: the University of California, School of Information and Computer Science, 2019
[11] Mia Hubert, Stephan Van der Veeken, Outlier detection for skewed data. Journal of Chemometrics, 20 March 2008
[12] Jason C. Travers, Bryan G. Cook, Lysandra Cook, Null Hypothesis Significance Testing and p Values. Learning Disabilities Research & Practice (LDRP), 11 September 2017
[13] David R. Anderson, Kenneth P. Burnham, William L. Thompson, Null Hypothesis Testing: Problems, Prevalence, and an Alternative, The Journal of Wildlife Management, Vol. 64, No. 4, pp 912-923, Oct 2000
[14] Tadayoshi Fushiki, Estimation of prediction error by using K-fold cross-validation.
Statistics and Computing volume 21, pages137–146,2011
[15] Mark Alan Peot, Geometric Implications of the Naive Bayes Assumption. Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, 13 Feb 2013
[16] L. Ambrosio, G. Dal Maso, A general chain rule for distributional derivatives. Proc. Amer. Math. Soc. 108, 691-702, 1990
[17] Vincy Cherian, Bindu M.S, Heart Disease Prediction Using Naïve Bayes Algorithm and Laplace Smoothing Technique. Department of Computer Science School Of Technology and Applied Sciences India, Mar-Apr 2017
[18] Reda Yacouby, Dustin Axman, Probabilistic Extension of Precision, Recall, and F1 measure for More Thorough Evaluation of Classification Models. Evaluation and Comparison of NLP Systems, November 2020