簡易檢索 / 詳目顯示

研究生: 黃柏崴
Bo-Wei Huang
論文名稱: 植基於屬性p值之樸素貝氏分類器
A Naive Bayes’ Classifier Based on The p-values of Features
指導教授: 楊維寧
Wei-Ning Yang
口試委員: 呂永和
Yung-Ho Leu
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 26
中文關鍵詞: 高斯樸素貝氏分類器閾值α高斯分佈P值(P-value)交叉驗證
外文關鍵詞: Gaussian Naive Bayesian Classifier, threshold, Gaussian Distribution, p-value, Cross-Validation
相關次數: 點閱:273下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

高斯樸素貝氏分類器基於每個屬性服從高斯分佈的假設來評估每個類別在各別屬性的概似函數(likelihood)。但關於真實資料集的實際研究中表明高斯分佈的假設可能是不合理的。本研究目的植基於樸素貝氏分類器透過p值(p-value)代替與每個屬性相對應的概似函數。在一個特定的類別,p值作為測試樣本中的屬性值是在訓練資料集中的屬性值比測試樣本中的屬性值更極端的比例,它是根據對應屬性的經驗分佈(Empirical Distribution)計算得出的。p值的大小反映了測試樣本中的屬性值與類別中的所預期屬性值之間的差異。對於特定的屬性,p 值大的測試樣本表示該測試樣本與該類別中的預期一致,因此更有可能被分類到該類別中。本研究所提出的樸素貝氏分類器將傳統樸素貝分類器中的概似函數乘積替換為p值的乘積。為了解決因為在高斯樸素貝氏分類器因不平衡資料集造成正陽性率(True positive rate)過低問題,使用10 fold交叉驗證方法來決定樸素貝氏分類的閾值α (threshold),以最大化F1 measure。實驗結果顯示與高斯樸素貝氏分類器相比,所提出的分類器取得了實質性的改進。


Gaussian naive Bayes’ classifiers evaluate the likelihood of individual feature for each class based on the assumption that each feature follows a Gaussian distribution. Empirical studies on real datasets often indicate the assumptions of Gaussian distributions may not be plausible. This research purposes a naive Bayes’ classifier where the likelihood corresponding to each feature is replaced by the p-value. For a specific class, the p-value of the feature value of a testing instance is the proportion of feature values more extreme than the feature value of the testing instance, which is computed based on the empirical distribution of the corresponding feature. The size of a p-value reflects the discrepancy between the feature value of the testing instance and the expected feature value in the class. For a specific feature, a testing instance with a large p-value indicates the testing instance is consistent with the expected instance in the class and therefore is more likely to be classified into the class. The proposed naive Bayes’ classifier replaces the product of likelihoods in the conventional naive Bayes’ classifier by the product of p-values. To alleviate the problem of low true positive rate for gaussian naive Bayesian of imbalance datasets, a 10 -fold cross validation scheme is used to determine the threshold for the naive Bayes’ classifier that achieves the maximum F1 measure. Empirical results show that the proposed classifier achieves substantial improvements when compared with the Gaussian naive Bayes’ classifier.

目錄 摘要 I Abstract II 誌謝 III 目錄 IV 圖目錄 V 表目錄 V 第1章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文架構 2 第2章 資料集與研究方法 3 2.1 資料集簡介 3 2.1.1 Diabetic Retinopathy Detection Datasets 3 2.1.2 Australian credit Approval Datasets 3 2.2 統計估計 4 2.2.1 經驗分佈(Empirical Distribution) 4 2.2.2 P值(P-value) 4 2.3 交叉驗證 5 2.3.1 K Fold Cross Validation 5 2.4 分類模型演算法 6 2.4.1 貝氏定理(Bayes Theorem) 6 2.4.2 樸素貝式分類器(Naïve Bayes Classifier) 7 2.4.3 高斯樸素貝式分類器(Gaussian Naïve Bayes Classifier) 8 2.4.4 P值用於分類應用 8 2.5 拉普拉斯平滑(Laplace Smoothing) 9 2.6 閾值α(threshold) 9 第3章 實驗步驟與結果 10 3.1 實驗步驟 10 3.1.1 方法1:結合K-Fold交叉驗證與閾值α用於高斯分類之方法 10 3.1.2 方法2:結合K-Fold交叉驗證與閾值α用於p值分類之方法 11 3.2 實驗結果 12 3.2.1 Diabetic Retinopathy Detection Dataset實驗結果 12 3.2.2 Australian credit Approval Datasets實驗結果 14 第4章 結論與未來展望 16 4.1 結論 16 4.2 未來展望 17 參考文獻 17

[1] Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.

[2] KR Koch, Bayes' theorem. Bayesian Inference with Geodetic Applications,1990

[3] Golam Kibria , Mohammad Shakil , Normal Distribution. Normal and Student´s t Distributions and Their Applications , pp 7–50, February 2014

[4] M. Kac, On Deviations between Theoretical and Empirical Distributions. Proceedings of the National Academy of Sciences , pp 252-257, May , 1949

[5] Partha Jyoti Hazarika , M. Masoom Ali, A multimodal skewed extension of normal distribution: its properties and applications. Statistics, pp859-877, Mar 2014

[6] Kevin P. Murphy, Naive Bayes classifiers. University of British Columbia, 2006

[7] Hawkins, Douglas M.,” The problem of overfitting,” Journal of Chemical Information and Modeling, 44(1), pp. 1–12, 1997.

[8] Michael WBrowne, Cross-Validation Methods. Journal of Mathematical Psychology
Volume 44, Issue 1, Pages 108-132, March 2000

[9] Balint Antal, Andras Hajdu, “Diabetic Retinopathy Debrecen Data Set,” UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. An ensemble-based system for automatic screening of diabetic retinopathy, Knowledge-Based Systems 60, pp 20-27, April 2014

[10] Dua.D , Graff, C, “Starlog (Australian Credit Approval) Data Set,” UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: the University of California, School of Information and Computer Science, 2019

[11] Mia Hubert, Stephan Van der Veeken, Outlier detection for skewed data. Journal of Chemometrics, 20 March 2008

[12] Jason C. Travers, Bryan G. Cook, Lysandra Cook, Null Hypothesis Significance Testing and p Values. Learning Disabilities Research & Practice (LDRP), 11 September 2017

[13] David R. Anderson, Kenneth P. Burnham, William L. Thompson, Null Hypothesis Testing: Problems, Prevalence, and an Alternative, The Journal of Wildlife Management, Vol. 64, No. 4, pp 912-923, Oct 2000

[14] Tadayoshi Fushiki, Estimation of prediction error by using K-fold cross-validation.
Statistics and Computing volume 21, pages137–146,2011

[15] Mark Alan Peot, Geometric Implications of the Naive Bayes Assumption. Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, 13 Feb 2013

[16] L. Ambrosio, G. Dal Maso, A general chain rule for distributional derivatives. Proc. Amer. Math. Soc. 108, 691-702, 1990

[17] Vincy Cherian, Bindu M.S, Heart Disease Prediction Using Naïve Bayes Algorithm and Laplace Smoothing Technique. Department of Computer Science School Of Technology and Applied Sciences India, Mar-Apr 2017

[18] Reda Yacouby, Dustin Axman, Probabilistic Extension of Precision, Recall, and F1 measure for More Thorough Evaluation of Classification Models. Evaluation and Comparison of NLP Systems, November 2020

無法下載圖示 全文公開日期 2024/07/12 (校內網路)
全文公開日期 2024/07/12 (校外網路)
全文公開日期 2024/07/12 (國家圖書館:臺灣博碩士論文系統)
QR CODE