研究生: |
黃建程 Chien-Cheng Huang |
---|---|
論文名稱: |
隨機森林應用於不平衡資料集之研究 A Study of Imbalanced Data Classification Problem Using Random Forest |
指導教授: |
陳維美
Wei-Mei Chen |
口試委員: |
陳維美
Wei-Mei Chen 陳永耀 Yung-Yao Chen 吳晉賢 Chin-Hsien Wu 阮聖彰 Shanq-Jang Ruan |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 46 |
中文關鍵詞: | 隨機森林 、分類器 、不平衡資料集 、欠採樣 |
外文關鍵詞: | Random Forest, classifier, imbalanced data, undersampling |
相關次數: | 點閱:222 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨機森林是被廣泛使用的分類演算法之一,其效能與可解釋性受到青睞,在醫學、金融、物品分類、異常檢測、票房預測等都有廣泛的應用。但大部分的分類器在不平衡資料集上的效能非常差,因此後人陸續提出成本敏感學習、Boosting、過採樣或欠採樣等方法來解決此問題。本篇論文透過分析資料相似性與資料分佈,找出多數樣本各資料的重要性,再進行合理的欠採樣。實驗結果顯示,我們提出的方法應用於隨機森林處理不平衡資料集時,能顯著提高對少數樣本的準確率,又不會過分失去對多數樣本的準確率。
Random Forest is one of the widely used classification algorithms whose efficiency and interpretability are favored, and it has wide applications in medicine, finance, item classification, anomaly detection, box office prediction, etc. However, most classifiers are ineffective on imbalanced datasets, so methods such as cost-sensitive learning, boosting, oversampling, or undersampling are proposed to solve this problem. This thesis analyzes the data similarity and data distribution to find the influence of each majority data and do under-sampling. The experimental results show that when our proposed method applies Random Forest to imbalanced datasets can significantly improve the accuracy of minority data without losing the accuracy of the majority data.
[1] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to data mining. Pearson Education India, 2016.
[2] F. Jáñez-Martino, R. Alaiz-Rodríguez, V. González-Castro, E. Fidalgo, and E. Alegre, “Classifying
spam emails using agglomerative hierarchical clustering and a topic-based approach,” Applied Soft
Computing, vol. 139, p. 110226, 2023.
[3] A. Ligthart, C. Catal, and B. Tekinerdogan, “Analyzing the effectiveness of semi-supervised learning
approaches for opinion spam classification,” Applied Soft Computing, vol. 101, p. 107023, 2021.
[4] M. Ristin, M. Guillaumin, J. Gall, and L. Van Gool, “Incremental Learning of Random Forests for
Large-Scale Image Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 38, no. 3, pp. 490–503, 2016.
[5] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò, “Deep Neural Decision Forests,” in 2015
IEEE International Conference on Computer Vision (ICCV), pp. 1467–1475, 2015.
[6] D.-C. Li, C.-W. Liu, and S. C. Hu, “A learning method for the class imbalance problem with medical
data sets,” Computers in Biology and Medicine, vol. 40, no. 5, pp. 509–518, 2010.
[7] T. Walter, J.-C. Klein, P. Massin, and A. Erginay, “A contribution of image processing to the diagno-
sis of diabetic retinopathy-detection of exudates in color fundus images of the human retina,” IEEE
Transactions on Medical Imaging, vol. 21, no. 10, pp. 1236–1243, 2002.
[8] Y. Bian, M. Cheng, C. Yang, Y. Yuan, Q. Li, J. L. Zhao, and L. Liang, “Financial fraud detection: a
new ensemble learning approach for imbalanced data,” 2016.
[9] K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, “Credit Card Fraud Detection Using
AdaBoost and Majority Voting,” IEEE Access, vol. 6, pp. 14277–14284, 2018.
[10] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep Learning for Anomaly Detection: A Review,”
ACM Comput. Surv., vol. 54, mar 2021.
[11] P. A. A. Resende and A. C. Drummond, “A Survey of Random Forest Based Methods for Intrusion
Detection Systems,” ACM Comput. Surv., vol. 51, may 2018.
[12] D. Maiorca, B. Biggio, and G. Giacinto, “Towards Adversarial Malware Detection: Lessons Learned
from PDF-Based Attacks,” ACM Comput. Surv., vol. 52, aug 2019.
[13] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The
bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943.
[14] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297,
1995.
[15] J. R. Quinlan, “Induction of Decision Trees,” Mach. Learn., vol. 1, p. 81–106, mar 1986.
[16] C. Zhang and Y. Ma, Ensemble Machine Learning: Methods and Applications. Springer Publishing
Company, Incorporated, 2014.
[17] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information
Theory, vol. 13, no. 1, pp. 21–27, 1967.
[18] H. Wang, P. Xu, and J. Zhao, “Improved KNN algorithms of spherical regions based on clustering and
region division,” Alexandria Engineering Journal, vol. 61, no. 5, pp. 3571–3585, 2022.
[19] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the
brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
[20] A. Y. Ng and M. I. Jordan, “On Discriminative vs. Generative Classifiers: A Comparison of Logistic
Regression and Naive Bayes,” in Proceedings of the 14th International Conference on Neural Infor-
mation Processing Systems: Natural and Synthetic, NIPS’01, (Cambridge, MA, USA), p. 841–848,
MIT Press, 2001.
[21] D. Castelvecchi, “Can we open the black box of AI?,” Nature News, vol. 538, no. 7623, p. 20, 2016.
[22] J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A comprehensive survey
on support vector machine classification: Applications, challenges and trends,” Neurocomputing,
vol. 408, pp. 189–215, 2020.
[23] S. Ali and K. A. Smith, “On learning algorithm selection for classification,” Applied Soft Computing,
vol. 6, no. 2, pp. 119–138, 2006.
[24] D. H. Wolpert, “On the connection between in-sample testing and generalization error,” Complex
Systems, vol. 6, no. 1, p. 47, 1992.
[25] D. H. Wolpert, “The lack of a priori distinctions between learning algorithms,” Neural computation,
vol. 8, no. 7, pp. 1341–1390, 1996.
[26] H. Kaur, H. S. Pannu, and A. K. Malhi, “A Systematic Review on Imbalanced Data Challenges in
Machine Learning: Applications and Solutions,” ACM Comput. Surv., vol. 52, aug 2019.
[27] D. Devi, S. K. Biswas, and B. Purkayastha, “A Review on Solution to Class Imbalance Problem: Un-
dersampling Approaches,” in 2020 International Conference on Computational Performance Evalu-
ation (ComPE), pp. 626–631, 2020.
[28] A. Gosain and S. Sardana, “Handling class imbalance problem using oversampling techniques: A re-
view,” in 2017 International Conference on Advances in Computing, Communications and Informatics
(ICACCI), pp. 79–85, 2017.
[29] E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, “SMOTE-RSB*: A Hybrid Preprocessing Ap-
proach Based on Oversampling and Undersampling for High Imbalanced Data-Sets Using SMOTE
and Rough Sets Theory,” Knowl. Inf. Syst., vol. 33, p. 245–265, nov 2012.
[30] N. Thai-Nghe, Z. Gantner, and L. Schmidt-Thieme, “Cost-sensitive learning methods for imbalanced
data,” in The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, 2010.
[31] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001.
[32] O. Sagi and L. Rokach, “Explainable decision forest: Transforming a decision forest into an inter-
pretable tree,” Information Fusion, vol. 61, pp. 124–138, 2020.
[33] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to linear regression analysis. John
Wiley & Sons, 2021.
[34] L. Breiman, Classification and regression trees. Routledge, 2017.
[35] Z.-H. Zhou, Ensemble methods: foundations and algorithms. CRC press, 2012.
[36] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu,
P. S. Yu, et al., “Top 10 algorithms in data mining,” Knowledge and information systems, vol. 14,
pp. 1–37, 2008.
[37] I. Triguero, S. Del Río, V. López, J. Bacardit, J. M. Benítez, and F. Herrera, “ROSEFW-RF: the winner
algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics
problem,” Knowledge-Based Systems, vol. 87, pp. 69–79, 2015.
[38] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, pp. 123–140, 1996.
[39] Y. Freund, R. E. Schapire, et al., “Experiments with a new boosting algorithm,” in icml, vol. 96,
pp. 148–156, Citeseer, 1996.
[40] R. E. Schapire and Y. Freund, “Boosting: Foundations and algorithms,” Kybernetes, vol. 42, no. 1,
pp. 164–166, 2013.
[41] S. González, S. García, J. Del Ser, L. Rokach, and F. Herrera, “A practical tutorial on bagging and
boosting based ensembles for machine learning: Algorithms, software tools, performance study, prac-
tical perspectives and opportunities,” Information Fusion, vol. 64, pp. 205–237, 2020.
[42] T. K. Ho, “Random decision forests,” in Proceedings of 3rd international conference on document
analysis and recognition, vol. 1, pp. 278–282, IEEE, 1995.
[43] M. Wainberg, B. Alipanahi, and B. J. Frey, “Are random forests truly the best classifiers?,” The Journal
of Machine Learning Research, vol. 17, no. 1, pp. 3837–3841, 2016.
[44] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers
to solve real world classification problems?,” The journal of machine learning research, vol. 15, no. 1,
pp. 3133–3181, 2014.
[45] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-
sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[46] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions
on Systems, Man, and Cybernetics, no. 3, pp. 408–421, 1972.
[47] T. Ivan, “An Experiment with the Edited Nearest-Neighbor Rule.,” IEEE Transactions on Systems,
Man, and Cybernetics, vol. 6, pp. 448–452, 1976.
[48] N. Junsomboon and T. Phienthrakul, “Combining over-sampling and under-sampling techniques for
imbalance dataset,” in Proceedings of the 9th International Conference on Machine Learning and
Computing, pp. 243–247, 2017.
[49] Z. Wang, C. Cao, and Y. Zhu, “Entropy and Confidence-Based Undersampling Boosting Random
Forests for Imbalanced Problems,” IEEE Transactions on Neural Networks and Learning Systems,
vol. 31, no. 12, pp. 5178–5191, 2020.
[50] W. Wang and D. Sun, “The improved AdaBoost algorithms for imbalanced data classification,” Infor-
mation Sciences, vol. 563, pp. 358–374, 2021.
[51] F. Zaman and H. Hirose, “Effect of subsampling rate on subbagging and related ensembles of stable
classifiers,” in Pattern Recognition and Machine Intelligence: Third International Conference, PReMI
2009 New Delhi, India, December 16-20, 2009 Proceedings 3, pp. 44–49, Springer, 2009.
[52] I. Triguero, S. González, J. M. Moyano, S. García López, J. Alcalá Fernández, J. Luengo Martín, A. L.
Fernández Hilario, M. J. d. Jesús Díaz, L. Sánchez, F. Herrera Triguero, et al., “KEEL 3.0: an open
source software for multi-stage analysis in data mining,” 2017.