簡易檢索 / 詳目顯示

研究生: 陳冠丞
Guan-Cheng Chen
論文名稱: 植基於離類別中心馬氏距離之貝氏分類器
A Bayes' Classifier Based on The Mahalanobis Distance from The Class Center
指導教授: 楊維寧
Wei-Ning Yang
口試委員: 呂永和
Yung-Ho Leu
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 28
中文關鍵詞: 高斯樸素貝氏分類器馬氏距離不平衡資料集交叉驗證過度擬合閾值p 值F1 分數
外文關鍵詞: Gaussian Naïve Bayes classifier, Mahalanobis Distance, Imbalanced Data, Cross Validation, Overfitting, Threshold, p-value, F1-score
相關次數: 點閱:311下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

高斯樸素貝氏分類器(Gaussian Naïve Bayes classifier)是利用各個屬性特徵遵循高斯分布的假設來評估每個屬性特徵之概似函數(likelihood),並且忽略彼此屬性間的相關性(correlations)問題,將每個屬性特徵相對應之概似函數乘積作為高斯樸素貝氏分類器中之概似函數。本研究植基於離類別中心(class center)馬氏距離之貝氏分類器,其方法將遵循多變量高斯分布(multivariate Gaussian distribution)特徵之間的相關性,對於一個特定的類別,利用測試樣本對應之p值(p-value)取代貝氏分類器之概似函數,p值的定義為在測試樣本中遠離訓練樣本之類別中心的機率,如果p值越大,表示該測試樣本越靠近該類別中心,因此更有機會將其分類至該類別中。為了改善不平衡資料集(imbalanced datasets)所造成正陽性率(True positive rate, TPR)低的問題,將利用K折交叉驗證(K-fold cross validation)方法找尋閾值 (threshold)(α),利用閾值(α)使分類器之F1分數(F1-score)達到最大化。實驗結果顯示,與高斯樸素貝氏分類器相比,本研究所提出的離類別中心馬氏距離之貝氏分類器更取得實質性的改進。


Gaussian Naïve Bayes' classifier evaluates the likelihood of individual feature based on the assumption that each feature follows a Gaussian distribution. Ignoring the correlations among the features, the product of likelihoods corresponding to features is used as the likelihood of the Gaussian Naïve Bayes' classifier. To account for the correlations among features following a multivariate Gaussian distribution, this research purposes a Bayes' classifier based on the Mahalanobis distance from the class center. For a specific class, the likelihood corresponding to a testing instance in the Bayes' classifier is replaced by a p-value which is defined as the proportion of training instances located further away from the class center than the testing instance. For a specific class, a testing instance with a large p-value indicates the testing instance is close to the class center and therefore is more likely to be classified into the class. To alleviate the problem of low true positive rate for imbalanced datasets, a K-fold cross validation scheme is used to determine the threshold for the Bayes' classifier that achieves the maximum F1- score. Empirical results show that the proposed classifier achieves substantial improvements when compared with the Gaussian Naïve Bayes' classifier.

摘要 I Abstract II 誌 謝 III 圖目錄 V 表目錄 V 第1章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文架構 2 第2章 資料集與研究方法 3 2.1 資料集簡介 3 2.1.1 多變量高斯分布模擬資料(Multivariate Gaussian Distribution) 3 2.1.2 Banknote Authentication (Bank) 3 2.1.3 Diabetic Retinopathy Debrecen (DRD) 4 2.1.4 Mammography (Breast Cancer) 4 2.2 資料處理方法 5 2.2.1 馬氏距離(Mahalanobis Distance) 5 2.3 分類演算法 6 2.3.1 貝氏定理(Bayes' Theorem) 6 2.3.2 樸素貝氏分類(Naïve Bayes classifier) 7 2.3.3 p值分類(p-value classifier) 8 2.3.4 閾值(α)分類 8 2.4 模型評估方法 8 2.4.1 交叉驗證(Cross Validation) 8 2.5 分類驗證指標 9 2.5.1 F1分數(F1-score) 9 第3章 實驗步驟與結果 10 3.1 實驗步驟 10 3.1.1 分類方法1:結合閾值α與高斯樸素貝氏K折交叉驗證之方法 10 3.1.2 分類方法2:結合閾值α與馬氏距離p值K折交叉驗證之方法 11 3.2 實驗結果 13 3.2.2 Banknote Authentication (Bank) 實驗結果 15 3.2.3 Diabetic Retinopathy Debrecen (DRD) 實驗結果 17 3.2.4 Mammography(Breast Cancer) 實驗結果 18 第4章 結論與未來展望 20 4.1 結論 20 4.2 未來展望 20 參考文獻 21

[1] Zupan, J. (1994). Introduction to artificial neural network (ANN) methods: what they are and how to use them. Acta Chimica Slovenica, 41, 327-327.
[2] Yu, Y., Si, X., Hu, C., & Zhang, J. (2019). A review of recurrent neural networks: LSTM cells and network architectures. Neural computation, 31(7), 1235-1270.
[3] Swinburne, R. (2004). Bayes' Theorem. Revue Philosophique de la France Et de l,
194(2).
[4] Lewis, D. D. (1998, April). Naive (Bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning (pp. 4-15).
Springer, Berlin, Heidelberg.
[5] Stuart J. Russell, Peter Norvig, Artificial Intelligence: A Modern Approach. 2010.
[6] Benesty, J., Chen, J., Huang, Y., & Cohen, I. (2009). Pearson correlation coefficient. In Noise reduction in speech processing (pp. 1-4). Springer, Berlin, Heidelberg.
[7] Xiang, S., Nie, F., & Zhang, C. (2008). Learning a Mahalanobis distance metric for data clustering and classification. Pattern recognition, 41(12), 3600-3612.
[8] Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). John Wiley & Sons.
[9] Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003, November). KNN model-based approach in classification. In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems" (pp. 986-996). Springer, Berlin, Heidelberg.
[10] Lipton, Z. C., Elkan, C., & Naryanaswamy, B. (2014, September). Optimal thresholding of classifiers to maximize F1 measure. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases (pp. 225-239). Springer,
Berlin, Heidelberg.
[11] Fujino, A., Isozaki, H., & Suzuki, J. (2008). Multi-label text categorization with model combination based on f1-score maximization. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.
[12] Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The annals of Statistics, 1135-1151.
[13] Dua, D. and Graff, C., “banknote authentication Data Set,” UCI Machine Learning
Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2019.
[14] Dua, D. and Graff, C., “Diabetic Retinopathy Debrecen Data Set,” UCI Machine
22 Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science, 2019.
[15] Dua, D. and Graff, C., “Mammography - Breast Cancer Data Set,” UCI Machine
Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science, 2019.
[16] Danielsson, P. E. (1980). Euclidean distance mapping. Computer Graphics and image processing, 14(3), 227-248.
[17] Sinwar, D., & Kaushik, R. (2014). Study of Euclidean and Manhattan distance metrics using simple k-means clustering. Int. J. Res. Appl. Sci. Eng. Technol, 2(5), 270-274.
[18] Abosamra, G., & Oqaibi, H. (2021). Using residual networks and cosine distancebased K-NN algorithm to recognize on-line signatures. IEEE Access, 9, 54962-54977.
[19] Hawkins, Douglas M.,” The problem of overfitting,” Journal of Chemical Information and Modeling, 44(1), pp. 1–12, 1997.
[20] Van der Aalst, W. M., Rubin, V., Verbeek, H. M. W., van Dongen, B. F., Kindler, E., & Günther, C. W. (2010). Process mining: a two-step approach to balance between
underfitting and overfitting. Software & Systems Modeling, 9(1), 87-111.
[21] Fushiki, T. (2011). Estimation of prediction error by using K-fold cross-validation. Statistics and Computing, 21(2), 137-146.
[22] Crovella, M. E., Taqqu, M. S., & Bestavros, A. (1998). Heavy-tailed probability
distributions. A Practical Guide to Heavy Tails Statistical Techniques and Applications.
[23] Mirjalili, S. (2019). Genetic algorithm. In Evolutionary algorithms and neural
networks (pp. 43-55). Springer, Cham.

無法下載圖示 全文公開日期 2024/07/13 (校內網路)
全文公開日期 2024/07/13 (校外網路)
全文公開日期 2024/07/13 (國家圖書館:臺灣博碩士論文系統)
QR CODE