簡易檢索 / 詳目顯示

研究生: 陳彥汝
Yan-Ru - Chen
論文名稱: 於不平衡網路借貸資料集中使用機器學習機制預測違約風險
Using Machine Learning Schemes to Predict Default Risk on Imbalanced Peer-to-Peer Lending Dataset
指導教授: 呂政修
Jenq-Shiou Leu
口試委員: 阮聖彰
Shanq-Jang Ruan 
鄭瑞光
Ray-Guang Cheng
袁錦鋒
Kam-Fung Yuen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 45
中文關鍵詞: 網路借貸機器學習不平衡資料集
外文關鍵詞: Peer-to-Peer Lending, Machine Learning, Imbalanced datasets
相關次數: 點閱:397下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著金融科技的崛起,也帶動了網路貸款(Peer-to-Peer Lending) 的蓬勃發展,有別於傳統的借貸模式,網路借貸開創了一種新的商業價值,對於不易於銀行申請貸款的個人或小型公司而言,網路貸款是一個方便的管道、帶來了更多可能性。然而網路借貸為一高風險商業行為,如何評估借款風險成為一個值得探究的議題,其中網路貸款資料明顯為一不平衡資料集(imbalanced datasets),真實世界中的資料多為不平衡資料集,例如工廠的不良品數量必遠低於正常品,而信用卡的詐欺盜刷記錄也遠少於正常的交易記錄。我們於此研究中嘗試利用機器學習的機制預測借貸的違約風險,然而不平衡資料集卻會嚴重影響機器學習模型建立之效能,因此我們利用重取樣及成本敏感學習機制預先對不平衡資料集進行處理,後以機器學習方法進行預測在本篇論文中,我們使用了幾種機器學習演算法建模預測網路貸款的違約風險,並使用重取樣和成本敏感機制來處理不平衡數據集。此外,我們使用Lending Club 的資料集驗證所提出的方法,實驗結果顯示我們提出的方法可以有效提高預測違約風險的準確性。


    In the past few years, Peer-to-Peer lending (P2P lending)has grown rapidly in the world. The main idea of P2P lending is disintermediation, removing the intermediaries like banks. For small business and some individuals without enough credit or credit history, P2P lending is a good way to loan money. However, the fundamental problem of P2P lending is information asymmetry in this model, which may not correctly estimate the default risk of lending. Lenders only determine whether or not to fund the loan by the information provided by borrowers, causing P2P lending data to be an imbalanced dataset, which contains unequal fully paid and default loans. A imbalanced dataset is quite common in real world, such as credit card fraud in transactions, bad products in the plant and so on. The imbalance phenomenon might affect the machine learning schemes, which is used to predict the repayment behavior, to tend to majority class for achieving a high accuracy. However, the characteristic of the minority class is much meaningful in the loaning
    business.In this thesis, we use several machine learning schemes to predict the default risk of P2P lending, and use re-sampling and cost-sensitive mechanisms to processing imbalanced datasets. Besides, we used the dataset from Lending Club to validate our proposed scheme. The experiment results show that our proposed scheme can effectively raise the prediction accuracy for default risk.

    1 緒論1 1.1 研究背景及動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究目的. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 章節提要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 相關研究背景知識4 2.1 網路貸款(Peer-to-Peer Lending) . . . . . . . . . . . . . . . . . . . . . 4 2.2 不平衡資料集於分類的問題. . . . . . . . . . . . . . . . . . . . . . . 7 2.3 相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 研究方法與處理流程10 3.1 研究方法及流程. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 資料前處理(Pre-process) . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 特徵選取(Feature Selection) . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 使用重取樣及成本敏感學習. . . . . . . . . . . . . . . . . . . . . . . 16 3.4.1 重取樣(Re-sampling) . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.2 成本敏感學習(Cost sensitive learning) . . . . . . . . . . . . . . 19 3.5 機器學習. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 實驗結果與評估25 4.1 實驗環境. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 資料集介紹及分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 評估方式. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 結論42 References 44

    [1] A. Mateescu.“Peer to Peer Lending.” datasociety.net. (2015)
    [2] B. Funk, A. Bachmann, A. Becker, D. Buerckner, M. Hilker, F. Kock, M. Lehmann,
    and P. Tiburtius. “Online Peer-to-Peer Lending - A Literature Review.” The Journal
    of Internet Banking and Commerce. (2011)
    [3] R. Emekter, Y. Tu, B. Jirasakuldech, and M. Lu. “Evaluating credit risk and loan
    performance in online Peer-to-Peer(P2P) lending.” Applied Economics, Vol. 47(1):
    pp. 54–70. (2014)
    [4] S.C. Carlos, G.N. Begoña, and L.P. Luz. “Determinants of Default in P2P Lending.” PLoS ONE Vol. 10(10). (2015)
    [5] Y. Jin, and Y. Zhu. “A Data-Driven Approach to Predict Default Risk of Loan for
    Online Peer-to-Peer (P2P) Lending.” The Fifth International Conference on Communication Systems and Network Technologies, pp. 609-613. (2015)
    [6] A. Byanjankar, M. Heikkilä, and J. Mezei. “Predicting Credit Risk in Peer-to-Peer Lending: A Neural Network Approach.” IEEE Symposium Series on Computational
    Intelligence, pp. 719-725. (2015)
    [7] C.V. KrishnaVeni, and T.S. Rani. “On the classification of imbalanced datasets.” International Journal of Computer Science and Technology(IJCST), Vol. 2, pp. 145-148.(2011)
    [8] N.V. Chawla, N. Japkowicz, and A. Kotcz. “Editorial: special issue on learning from imbalanced data sets.” ACM SIGKDD Explorations Newsletter, Vol. 6(1), pp. 1–6.
    (2004)
    [9] N.V. Chawla. “Data mining for imbalanced datasets: An overview.” Data mining and
    knowledge discovery handbook, pp. 853-867. (2005)
    [10] N.V. Chawla, K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer. “SMOTE: synthetic
    minority over-sampling technique.” Journal of Artificial Intelligence Research, Vol.
    16, pp. 321-357. (2002)
    [11] H. Han, W.Y. Wang, and B.H. Mao. “Borderline-SMOTE: a new over-sampling
    method in imbalanced data sets learning.” International Conference on Intelligent
    Computing, pp. 878-887. (2005)

    QR CODE