簡易檢索 / 詳目顯示

研究生: 許書銘
Shu-ming Hsu
論文名稱: 一個基於反向最近鄰居的資料挑選方法
A Reverse Nearest Neighbor Based Instance Selection Algorithm
指導教授: 戴碧如
Bi-ru Dai
口試委員: 李育杰
Yuh-jye Lee
徐國偉
Kuo-wei Hsu
彭文志
Wen-chih Peng
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 30
中文關鍵詞: 資料縮減分類資料挑選最近鄰居反向最近鄰居
外文關鍵詞: Data reduction, classification, instance selection, nearest neighbor, reverse nearest neighbor
相關次數: 點閱:154下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 資料縮減的目的在於從一個資料集中篩選出一個子集,而資料縮減的好處在於減少儲存空間的需求並且增進分類的效率。使用該子集作為訓練資料不僅可能維持住分類正確率,甚至在刪除雜訊的情況下能得到進一步的改善,因此關鍵在於如何在挑選具代表性樣本的同時避開這些雜訊。許多資料挑選方法的設計是基於最近鄰居,而這些方法當中依照挑選樣本策略的不同可分為漸進及漸減兩種。漸進方法的作法是先挑選一些資料作為樣本,再遞迴地將與其最近樣本不同類別的資料逐一加入至樣本集合中。而漸減方法的作法是依照其策略逐漸地刪去資料,最後留下的資料即為樣本。我們提出一個基於反向最近鄰居的方法,稱為反向最近鄰居縮減(RNNR)。RNNR選擇能夠代表同類別中其他資料的資料作為樣本,此外RNNR不需要花費許多時間去遞迴地讀取一個資料集。實驗結果呈現出RNNR比起其他比較對象普遍達到較高的正確率、挑選較少的樣本,以及花費較少的處理時間。


    Data reduction is to extract a subset from a dataset. The advantage of data reduction is decreasing the requirement of storage. Using the subset as training data is possible to maintain classification accuracy; sometimes, it can be further improved because of eliminating noises. The key is how to choose representative samples while ignoring noises at the same time. Many instance selection algorithms are based on Nearest Neighbor decision rule (NN). Some of these algorithms select samples based on two strategies, incremental and decremental. The first type of algorithms selects some instances as samples and iteratively adds instances which do not have the same class label with their nearest sample to the sample set. The second type of algorithms gradually removes instances based on its own strategies. However, we propose an algorithm based on Reverse Nearest Neighbor (RNN), called Reverse Nearest Neighbor Reduction (RNNR). RNNR selects samples which can represent other instances in the same class. In addition, RNNR does not need to iteratively scan a dataset which takes much processing time. Experimental results show that RNNR generally achieves higher accuracy, selects fewer samples and takes less processing time than comparators.

    Abstract 中文摘要 List of Tables List of Figures List of Contents 1 Introduction 1.1 Motivation 1.2 Background 1.3 Contribution 1.4 The Organization of this Thesis 2 Related Works 2.1 Incremental Algorithms 2.2 Decremental Algorithms 3 The Reverse Nearest Neighbor Reduction Algorithm 3.1 Noise Removal 3.2 Reverse k Nearest Neighbor and Candidate Sample Selection 3.3 The Sample Reduction Process 4 Experiments 4.1 Experiments on Noise Removal 4.2 Experiments on Absorption Strategy 4.3 Experiments on Sample Reduction 4.4 Performance Comparison with Other Algorithms 5 Conclusions and Future Works References

    1. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2ed (2000)
    2. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2ed (2006)
    3. Tan, P.N.., Steinbach, M., Kumar, V.: Introduction to Data Mining (2006)
    4. Keller, J.M., Gray, M.R..,Givens, J.A.JR.: A fuzzy K-nearest neighbor algorithm. IEEE Transactions on Systems, Man, and Cybernetics, vol. 15, no. 4, pp. 580-585 (1985)
    5. Seidl, T., Kriegel, H.P.: Optimal multi-step k-nearest neighbor search. In Proc. of 1998 ACM SIGMOD International Conference on Management of Data, pp. 154-165 (1998)
    6. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167 (1998)
    7. Quinlan, J.R.: Induction of decision trees. Machine Learning, vol. 1, no.1, pp. 81-106 (1986)
    8. Knaus, W.A., Draper, E.A., Wagner, D.P., Zimmerman, J.E.: APACHE II: a severity of disease classification system. Critical Care Medicine, vol. 13, no. 10, pp. 818-829 (1985)
    9. Joachims, T.: Transductive inference for text classification using support vector machines. In Proc. of 16th International Conference on Machine Learning, pp. 200-209 (1999)
    10. Joachims, T.: Optimizing search engines using clickthrough data. In Proc. of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133-142 (2002)

    11. Dumais, S., Chen, H.: Hierarchical classification of Web content. In Proc. of 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256-263 (2000)
    12. Haralick, R.M., Shanmugam, K., Dinstein, Its'Hak: Textural features for image classification. IEEE Transactions on Systems, Man and Cybernetics, vol. 3, issue 6, pp. 610-621 (1973)
    13. Bevington, P. R., Robinson, D. K.: Data Reduction and Error Analysis for the Physical Sciences, 3ed (2002)
    14. Tonry, J., Davis, M.: A survey of galaxy redshifts. I - Data reduction techniques. Astronomical Journal, vol. 84, pp. 1511-1525. (1979)
    15. Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153-172 (2002)
    16. Cano, J.R., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Transactions on Evolutionary Computation, vol. 7, issue 6, pp.561-575 (2003)
    17. Hart, P.: The condensed nearest neighbor rule. IEEE Transactions on Information Theory, vol. 14, pp. 515-516 (1968)
    18. Devi, F.S., Murty, M.N.: An incremental prototype set building technique. Pattern Recognition, vol. 35, issue 2, pp. 505-513 (2002)
    19. Angiulli, F.: Fast nearest neighbor condensation for large data sets classification. IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 11, pp. 1450-1464 (2007)
    20. Chou, C.H., Kuo, B.H., Chang, F.: The generalized condensed nearest neighbor rule as a data reduction method. In Proc. of 18th International Conference on Pattern Recognition, pp. 556–559 (2006)
    21 Brighton, H., Mellish, C.: On the consistency of information filters for lazy learning algorithms. In Proc. of 3rd European Conference on Principles of Data Mining and Knowledge Discovery, pp. 283-288 (1999)
    22. Marchiori, E.: Hit miss networks with applications to instance selection. The Journal of Machine Learning Research, vol. 9, pp. 997-1017 (2008)
    23. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning, vol. 38, no 3, pp. 257-286 (2000)
    24. Delany, S.J.: The Good, the bad and the incorrectly classified: Profiling cases for case-base editing. In Proc. of 8th International Conference of Case-Based Reasoning Research and Development, vol. 5650, pp. 135-149 (2009)
    25. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory, vol. 13, issue 1, pp. 21-27 (1967)
    26. Korn, F., Muthukrishnan, S.: Influence sets based on reverse nearest neighbor queries. In Proc. of 2000 ACM SIGMOD International Conference on Management of Data, pp. 201–212 (2000)
    27. Yang, C., Lin, K.I.: An index structure for efficient reverse nearest neighbor queries. In Proc. of 17th International Conference on Data Engineering, pp. 485-492 (2001)
    28. Achtert, E., Bohm, C., Kroger, P., Kunath, P., Renz, M., Pryakhin, A.: Efficient reverse k-nearest neighbor search in metric spaces. In Proc. of 2006 ACM SIGMOD International Conference on Management of Data, pp. 515-526 (2006)
    29. Tao, Y., Yiu, M. L., Mamoulis, N.: Reverse nearest neighbor search in metric spaces. IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1239-1252 (2006)
    30. Aha, D., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning, vol. 6, no. 1, pp. 37-66 (1991)
    31. Dunn, J.C.: Some recent investigations of a new fuzzy partition algorithm and its application to pattern classification problems. Journal of Cybernetics, vol. 4, issue 2, pp.1-15 (1974)
    32. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, vol. 2, issue 3, pp. 408–421 (1972)
    33. Smyth, B., Keane, M.: Remembering to forget: A competence preserving case deletion policy for CBR systems. In Proc. of 14th International Joint Conference on Artificial Intelligence, pp. 377–382 (1995)
    34. Delany, S.J., Cunningham, P.: An analysis of case-based editing in a spam filtering system. In Proc. of 7th European Conference on Case-Based Reasoning, vol. 3155, pp. 128–141 (2004)
    35. University of California, Irvine (UCI) Machine Learning (ML) Repository,
    http://archive.ics.uci.edu/ml/
    36. University of Waikato, Weka Machine Learning Project,
    http://www.cs.waikato.ac.nz/ml/weka/

    無法下載圖示 全文公開日期 2016/07/27 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE