簡易檢索 / 詳目顯示

研究生: 黃建程
Chien-Cheng Huang
論文名稱: 隨機森林應用於不平衡資料集之研究
A Study of Imbalanced Data Classification Problem Using Random Forest
指導教授: 陳維美
Wei-Mei Chen
口試委員: 陳維美
Wei-Mei Chen
陳永耀
Yung-Yao Chen
吳晉賢
Chin-Hsien Wu
阮聖彰
Shanq-Jang Ruan
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 46
中文關鍵詞: 隨機森林分類器不平衡資料集欠採樣
外文關鍵詞: Random Forest, classifier, imbalanced data, undersampling
相關次數: 點閱:222下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

隨機森林是被廣泛使用的分類演算法之一,其效能與可解釋性受到青睞,在醫學、金融、物品分類、異常檢測、票房預測等都有廣泛的應用。但大部分的分類器在不平衡資料集上的效能非常差,因此後人陸續提出成本敏感學習、Boosting、過採樣或欠採樣等方法來解決此問題。本篇論文透過分析資料相似性與資料分佈,找出多數樣本各資料的重要性,再進行合理的欠採樣。實驗結果顯示,我們提出的方法應用於隨機森林處理不平衡資料集時,能顯著提高對少數樣本的準確率,又不會過分失去對多數樣本的準確率。


Random Forest is one of the widely used classification algorithms whose efficiency and interpretability are favored, and it has wide applications in medicine, finance, item classification, anomaly detection, box office prediction, etc. However, most classifiers are ineffective on imbalanced datasets, so methods such as cost-sensitive learning, boosting, oversampling, or undersampling are proposed to solve this problem. This thesis analyzes the data similarity and data distribution to find the influence of each majority data and do under-sampling. The experimental results show that when our proposed method applies Random Forest to imbalanced datasets can significantly improve the accuracy of minority data without losing the accuracy of the majority data.

中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 英文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 圖目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 表目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究背景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 論文架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 文獻探討 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 與分類器相關的研究 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 決策樹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 集成學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 隨機森林 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 類別不平衡問題 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Data-level methods . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Algorithm-level methods . . . . . . . . . . . . . . . . . . . . . . 9 3 研究方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 問題定義 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 演算法概述 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.1 計算資料集特性 . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 迭代訓練分類器 . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 訓練最終的集成模型 . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.4 演算法流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 實驗結果與分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1 評估指標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 實驗環境 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 實驗資料與設定 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4 實驗結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.1 合成資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.2 MACC 性能比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4.3 ACC 性能比較 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.4 Preciosion 性能比較 . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.5 Recall 性能比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4.6 F1-socre 性能比較 . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4.7 總執行時間比較 . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

[1] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to data mining. Pearson Education India, 2016.
[2] F. Jáñez-Martino, R. Alaiz-Rodríguez, V. González-Castro, E. Fidalgo, and E. Alegre, “Classifying
spam emails using agglomerative hierarchical clustering and a topic-based approach,” Applied Soft
Computing, vol. 139, p. 110226, 2023.
[3] A. Ligthart, C. Catal, and B. Tekinerdogan, “Analyzing the effectiveness of semi-supervised learning
approaches for opinion spam classification,” Applied Soft Computing, vol. 101, p. 107023, 2021.
[4] M. Ristin, M. Guillaumin, J. Gall, and L. Van Gool, “Incremental Learning of Random Forests for
Large-Scale Image Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 38, no. 3, pp. 490–503, 2016.
[5] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò, “Deep Neural Decision Forests,” in 2015
IEEE International Conference on Computer Vision (ICCV), pp. 1467–1475, 2015.
[6] D.-C. Li, C.-W. Liu, and S. C. Hu, “A learning method for the class imbalance problem with medical
data sets,” Computers in Biology and Medicine, vol. 40, no. 5, pp. 509–518, 2010.
[7] T. Walter, J.-C. Klein, P. Massin, and A. Erginay, “A contribution of image processing to the diagno-
sis of diabetic retinopathy-detection of exudates in color fundus images of the human retina,” IEEE
Transactions on Medical Imaging, vol. 21, no. 10, pp. 1236–1243, 2002.
[8] Y. Bian, M. Cheng, C. Yang, Y. Yuan, Q. Li, J. L. Zhao, and L. Liang, “Financial fraud detection: a
new ensemble learning approach for imbalanced data,” 2016.
[9] K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, “Credit Card Fraud Detection Using
AdaBoost and Majority Voting,” IEEE Access, vol. 6, pp. 14277–14284, 2018.
[10] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep Learning for Anomaly Detection: A Review,”
ACM Comput. Surv., vol. 54, mar 2021.
[11] P. A. A. Resende and A. C. Drummond, “A Survey of Random Forest Based Methods for Intrusion
Detection Systems,” ACM Comput. Surv., vol. 51, may 2018.
[12] D. Maiorca, B. Biggio, and G. Giacinto, “Towards Adversarial Malware Detection: Lessons Learned
from PDF-Based Attacks,” ACM Comput. Surv., vol. 52, aug 2019.
[13] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The
bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943.
[14] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297,
1995.
[15] J. R. Quinlan, “Induction of Decision Trees,” Mach. Learn., vol. 1, p. 81–106, mar 1986.
[16] C. Zhang and Y. Ma, Ensemble Machine Learning: Methods and Applications. Springer Publishing
Company, Incorporated, 2014.
[17] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information
Theory, vol. 13, no. 1, pp. 21–27, 1967.
[18] H. Wang, P. Xu, and J. Zhao, “Improved KNN algorithms of spherical regions based on clustering and
region division,” Alexandria Engineering Journal, vol. 61, no. 5, pp. 3571–3585, 2022.
[19] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the
brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
[20] A. Y. Ng and M. I. Jordan, “On Discriminative vs. Generative Classifiers: A Comparison of Logistic
Regression and Naive Bayes,” in Proceedings of the 14th International Conference on Neural Infor-
mation Processing Systems: Natural and Synthetic, NIPS’01, (Cambridge, MA, USA), p. 841–848,
MIT Press, 2001.
[21] D. Castelvecchi, “Can we open the black box of AI?,” Nature News, vol. 538, no. 7623, p. 20, 2016.
[22] J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A comprehensive survey
on support vector machine classification: Applications, challenges and trends,” Neurocomputing,
vol. 408, pp. 189–215, 2020.
[23] S. Ali and K. A. Smith, “On learning algorithm selection for classification,” Applied Soft Computing,
vol. 6, no. 2, pp. 119–138, 2006.
[24] D. H. Wolpert, “On the connection between in-sample testing and generalization error,” Complex
Systems, vol. 6, no. 1, p. 47, 1992.
[25] D. H. Wolpert, “The lack of a priori distinctions between learning algorithms,” Neural computation,
vol. 8, no. 7, pp. 1341–1390, 1996.
[26] H. Kaur, H. S. Pannu, and A. K. Malhi, “A Systematic Review on Imbalanced Data Challenges in
Machine Learning: Applications and Solutions,” ACM Comput. Surv., vol. 52, aug 2019.
[27] D. Devi, S. K. Biswas, and B. Purkayastha, “A Review on Solution to Class Imbalance Problem: Un-
dersampling Approaches,” in 2020 International Conference on Computational Performance Evalu-
ation (ComPE), pp. 626–631, 2020.
[28] A. Gosain and S. Sardana, “Handling class imbalance problem using oversampling techniques: A re-
view,” in 2017 International Conference on Advances in Computing, Communications and Informatics
(ICACCI), pp. 79–85, 2017.
[29] E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, “SMOTE-RSB*: A Hybrid Preprocessing Ap-
proach Based on Oversampling and Undersampling for High Imbalanced Data-Sets Using SMOTE
and Rough Sets Theory,” Knowl. Inf. Syst., vol. 33, p. 245–265, nov 2012.

[30] N. Thai-Nghe, Z. Gantner, and L. Schmidt-Thieme, “Cost-sensitive learning methods for imbalanced
data,” in The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, 2010.
[31] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001.
[32] O. Sagi and L. Rokach, “Explainable decision forest: Transforming a decision forest into an inter-
pretable tree,” Information Fusion, vol. 61, pp. 124–138, 2020.
[33] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to linear regression analysis. John
Wiley & Sons, 2021.
[34] L. Breiman, Classification and regression trees. Routledge, 2017.
[35] Z.-H. Zhou, Ensemble methods: foundations and algorithms. CRC press, 2012.
[36] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu,
P. S. Yu, et al., “Top 10 algorithms in data mining,” Knowledge and information systems, vol. 14,
pp. 1–37, 2008.
[37] I. Triguero, S. Del Río, V. López, J. Bacardit, J. M. Benítez, and F. Herrera, “ROSEFW-RF: the winner
algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics
problem,” Knowledge-Based Systems, vol. 87, pp. 69–79, 2015.
[38] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, pp. 123–140, 1996.
[39] Y. Freund, R. E. Schapire, et al., “Experiments with a new boosting algorithm,” in icml, vol. 96,
pp. 148–156, Citeseer, 1996.
[40] R. E. Schapire and Y. Freund, “Boosting: Foundations and algorithms,” Kybernetes, vol. 42, no. 1,
pp. 164–166, 2013.
[41] S. González, S. García, J. Del Ser, L. Rokach, and F. Herrera, “A practical tutorial on bagging and
boosting based ensembles for machine learning: Algorithms, software tools, performance study, prac-
tical perspectives and opportunities,” Information Fusion, vol. 64, pp. 205–237, 2020.
[42] T. K. Ho, “Random decision forests,” in Proceedings of 3rd international conference on document
analysis and recognition, vol. 1, pp. 278–282, IEEE, 1995.
[43] M. Wainberg, B. Alipanahi, and B. J. Frey, “Are random forests truly the best classifiers?,” The Journal
of Machine Learning Research, vol. 17, no. 1, pp. 3837–3841, 2016.
[44] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers
to solve real world classification problems?,” The journal of machine learning research, vol. 15, no. 1,
pp. 3133–3181, 2014.
[45] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-
sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[46] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions
on Systems, Man, and Cybernetics, no. 3, pp. 408–421, 1972.
[47] T. Ivan, “An Experiment with the Edited Nearest-Neighbor Rule.,” IEEE Transactions on Systems,
Man, and Cybernetics, vol. 6, pp. 448–452, 1976.
[48] N. Junsomboon and T. Phienthrakul, “Combining over-sampling and under-sampling techniques for
imbalance dataset,” in Proceedings of the 9th International Conference on Machine Learning and
Computing, pp. 243–247, 2017.
[49] Z. Wang, C. Cao, and Y. Zhu, “Entropy and Confidence-Based Undersampling Boosting Random
Forests for Imbalanced Problems,” IEEE Transactions on Neural Networks and Learning Systems,
vol. 31, no. 12, pp. 5178–5191, 2020.
[50] W. Wang and D. Sun, “The improved AdaBoost algorithms for imbalanced data classification,” Infor-
mation Sciences, vol. 563, pp. 358–374, 2021.
[51] F. Zaman and H. Hirose, “Effect of subsampling rate on subbagging and related ensembles of stable
classifiers,” in Pattern Recognition and Machine Intelligence: Third International Conference, PReMI
2009 New Delhi, India, December 16-20, 2009 Proceedings 3, pp. 44–49, Springer, 2009.
[52] I. Triguero, S. González, J. M. Moyano, S. García López, J. Alcalá Fernández, J. Luengo Martín, A. L.
Fernández Hilario, M. J. d. Jesús Díaz, L. Sánchez, F. Herrera Triguero, et al., “KEEL 3.0: an open
source software for multi-stage analysis in data mining,” 2017.

無法下載圖示 全文公開日期 2025/08/15 (校內網路)
全文公開日期 2025/08/15 (校外網路)
全文公開日期 2025/08/15 (國家圖書館:臺灣博碩士論文系統)
QR CODE