研究生: |
朱芷萱 Chih-Hsuan Chu |
---|---|
論文名稱: |
簡單貝氏分類器結合p-值之研究 Naive Bayesian classifier based on p-values. |
指導教授: |
楊維寧
Wei-Ning Yang |
口試委員: |
陳雲岫
Yun-Shiow Chen 呂永和 Yung-Ho Leu |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 30 |
中文關鍵詞: | 簡單貝氏分類器 、主成份分析 、p-值 |
外文關鍵詞: | Naive Bayesian Classifier, Principal Component Analysis, p-value |
相關次數: | 點閱:316 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究主要是應用「簡單貝式分類器」 結合「主成分分析」 法與統計推論中的「 p-值」 進行二元問題分類,並經由「屬性變數」 的篩選出主要相關因素來提升分類的準確率。貝氏分類器是依據待歸類物件的「屬性向量」將物件歸類於最有可能的類別。用以歸類物件於各類別的機率稱為 「事後機率」。「事後機率是指觀察到待歸類物件的「屬性向量」之後,依據「屬性向量」 在各類別中出現的機率 (likelihood) 來修正物件歸屬於各類別的「事前機率」。具有某「屬性向量」 之物件歸屬各類別的「事後機率」 正比於物件歸屬各類別的「事前機率」 與「屬性向量」 在各類別中出現機率的乘積。本研究應用「主成分分析法」去除「屬性變數」之間的關聯性,以達到「貝式分類器」各「屬性」獨立的假設。「假設檢定」中的 p-值 (p-value) 大小主要是反映「實際觀察到的」 與「當假設為真時所預期的」之間的落差大小,p-值愈小表示落差愈大。本研究是以p-值取代貝氏分類器中「屬性向量」 在各類別中出現的機率(likelihood) 。本研究並以乳癌資料集進行研究方法的統計實驗。
Naive Bayesian classifier estimates the joint likelihood of a testing instance as the product of the likelihood for each individual feature estimated from the training data and then applies Bayes' rule to calculate the posterior distribution of the class. In addition to the likelihood, p-value in statistical hypothesis testing which reflects the discrepancy between the observed sample and the expected sample under some hypothesis serves similar purpose and will be used to replace the likelihood in the proposed Bayesian classifier. We alleviate the naive independence assumption among features for each class by applying principal component analysis to obtain the uncorrelated transformed features. The joint p-value in the proposed Bayesian classifier which is the product of the p-value associated with each transformed feature estimated from the training data is used to calculate, in conjunction with the prior distribution, the posterior p-value for the testing instance. Empirical results demonstrate substantial improvement on the classification accuracy when compared with the existing classification methods.
[1] I. Rish. An empirical study of the naive Bayes classifier. T.J. Watson Research Center, 2001.
[2] P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103–130, 1997.
[3] N. Friedman, D. Geiger, and Goldszmidt M. Bayesian network classifiers. Machine Learning, 29:131–163, 1997.
[4] J. Hilden. Statistical diagnosis based on conditional independence does not require it. Comput. Biol. Med., 14(4):429–435,1984.
[5] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 399–406, San Jose, CA, 1992. AAAI Press.
[6] J. Hellerstein, Jayram Thathachar, and I. Rish. Recognizing end-user transactions in performance management. In Proceedings ofAAAI-2000, pages 596–602, Austin, Texas, 2000.
[7] I. Rish, J. Hellerstein, and T. Jayram. An analysis of data characteristics that affect naive Bayes performance. Technical Report RC21993, IBM T.J. Watson Research Center, 2001
[8] C. J. Colbourn, Learning Augmented Bayesian Classifiers: A Comparison of Distribution-based and Classification-based Approaches (1999), New York: Oxford University Press(1987).
[9] Kim Esbensen, Paul Geladi, Principal component analysis, Chemometrics and Intelligent Laboratory Systems, 2 (1987) 37-52.
[10] a H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24 (1933), pp. 417–441.
[11] Koller, Daphne and Sahami, Mehran, Toward Optimal Feature Selection, Toward Optimal Feature Selection Technical Report. Stanford InfoLab , (1996)
[12] Pal, M., Foody, G.M, Feature Selection for Classification of Hyperspectral Data by SVM, Geoscience and Remote Sensing, IEEE Transactions on (Volume:48 , Issue: 5 ). (2010)
[13] Monalisa Mandal, Anirban Mukhopadhyay, An Improved Minimum Redundancy Maximum Relevance Approach for Feature Selection in Gene Expression Data, Procedia Technology, Modeling Techniques and Applications Volume 10, 2013, Pages 20–27.
[14] Kemal Polat, Salih Güne¸s, Breast cancer diagnosis using least square support vector machine, Digital Signal Processing 17 (2007) 694–701