簡易檢索 / 詳目顯示

研究生: 葉日揚
Jih-Yang Yeh
論文名稱: 一個結合極限梯度提升分類模型與關鍵字抽出方法的釣魚網站偵測服務架構
A Phishing Website Detection Service Mechanism Utilizing XGBoost Classification Model and Key-term Extraction Method
指導教授: 鄧惟中
Wei-Chung Teng
口試委員: 林宗男
Tsung-Nan Lin
卓政宏
Cheng-Hung Cho
鄧惟中
Wei-Chung Teng
王勝德
Sheng-De Wang
沈上翔
Shan-Hsiang Shen
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 39
中文關鍵詞: 釣魚網站極限梯度提升演算法關鍵字偵測特徵前處理滑動窗口
外文關鍵詞: phishing attack, XGBoost, keyterm detection, pre-processing, sliding window
相關次數: 點閱:308下載:13
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究提出了一個結合一般釣魚網頁分類器與關鍵字抽出方法的釣魚網頁偵測架構,利用近年來kaggle 競賽中表現優異的學習演算法XGBoost 建構分類模型,嘗試使用不同的前處理方法增進效能,並加入關鍵字抽取以協助降低釣魚網站分類模型的誤判率。

    關鍵字抽取方法的發想來自於以下的觀察:進行網路釣魚的攻擊者通常試圖使釣魚網頁看起來與其模仿目標相似,因此釣魚網頁中極可能會在多處留下線索,透露出其模仿對象,因此我們透過多來源比對來找出模仿目標名或與模仿目標極相關的關鍵字;另一方面,合法網頁在搜尋引擎中的排名必定靠前,因此以關鍵字搜索的結果排名即可做為判斷釣魚網站的參考。此方法主要達到的效果為可抓出模仿特定目標的釣魚網頁,與修正被誤判的合法網頁。另外本架構使用了滑動窗口機制降低訓練成本,以少量的訓練資料即可訓練出同樣效能的模型。

    在效果驗證部分,我們使用來自PhishTank 與Alexa 上取得的標記後資料進行實驗。在加入關鍵字抽取方法後,能有效地修正被誤判的合法網頁,而達到99%以上的準確率。


    This research proposes a phishing website detection mechanism that combines an XGBoost based phishing website classifier and the key-term extraction method. Some pre-processing techniques are also developed to enhance the performance. XGBoost is well known for its high efficiency and accuracy, and the key-term based detection method helps to minimize the false positive rate of the phishing website classification model.

    The key-term extraction method is based on two observation: Phishers usually try to make phishing websites look similar to their imitation targets, therefore there must be clues, or key terms, behind website related sources that reveal their imitation target; On the other hand, legitimate websites must be ranked high in search engines, so the ranking of search results of key terms serve as a good reference. The main function of this method is to capture the specific target of the phishing website if there has one and correct the legitimate websites that are misclassified. In addition, the proposed mechanism introduces a sliding window technique to reduce training costs, so as to reach the same performance with the smaller training data.

    The framework proposed in this research uses the data crawling from PhishTank and Alexa, and experiments are conducted after labeling. Without the key-term detection method, the accuracy rate is about 98\%. After enabling the key-term method, the number of the misclassified legitimate website is further reduced such that the accuracy rate raised to 99%.

    論文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II 誌謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III 目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV 圖目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII 表目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX 1 序論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 研究背景. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究動機與目的. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 研究貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 釣魚網站偵測方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 黑名單法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 啟發式. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 網頁的影像相似度. . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.4 透過機器學習. . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 研究方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 系統運作流程. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 系統架構總攬. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.1 特徵萃取方法. . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.2 使用的特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 前處理方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4.1 PCA(Principle Components Analysis) . . . . . . . . . . . . . . . 14 3.4.2 LDA(Linear Discriminant Analysis) . . . . . . . . . . . . . . . . 15 3.4.3 特徵組合方法. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 釣魚網站分類模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.6 關鍵字抽取方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.1 關鍵字萃取(keyterm extractor) . . . . . . . . . . . . . . . . . . 17 3.6.2 檢測流程. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 資料集介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.1 Phishtank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.2 Alexa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.3 UCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 資料蒐集(Crawling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 評估標準. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.2 F-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.3 Cumulative Error Rate . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 前處理方法與分類演算法比較. . . . . . . . . . . . . . . . . . . . . . 25 4.4.1 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5 關鍵字抽取方法效能測試. . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5.1 目標辨識測試. . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5.2 效能測試. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.6 滑動窗口實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5 結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    [1] Kaspersky, “Anti-phishing technology,” 2012.
    [2] “淺談釣魚攻擊之釣魚網站.” http://www.cc.ntu.edu.tw/chinese/epaper/0040/20170320_4009.html. (Accessed on 07/29/2018).
    [3] APWG, “Phishing activity trends report Q4 2017,” no. December, p. 11, 2017.
    [4] M. Khonji, Y. Iraqi, and A. Jones, “Phishing detection: A literature survey,” 2013.
    [5] D. Sahoo, C. Liu, and S. C. H. Hoi, “Malicious url detection using machine learning:A Survey,” 2017.
    [6] “Google safe browsing | google developers.” https://developers.google.com/safe-browsing/?csw=1. (Accessed on 06/20/2018).
    [7] M. Atighetchi and P. Pal, “Attribute-based prevention of phishing attacks,” Proceedings- 2009 8th IEEE International Symposium on Network Computing and Applications,NCA2009, pp. 266–269, 2009.
    [8] N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell, “Client-side defense against web-based identity theft,” The Network and Distributed System Security Symposium 2004, pp. 1–16, 2004.
    [9] A. K. Jain, “Comparative Analysis of Features Based Machine Learning Approaches for Phishing Detection,” 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 2125–2130, 2016.
    [10] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, “CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites,” ACM Transactions on Information and System Security, vol. 14, no. 21, 2011.
    [11] M. Darling, G. Heileman, G. Gressel, A. Ashok, and P. Poornachandran, “A lexical approach for classifying malicious URLs,” Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015, pp. 195–202, 2015.
    [12] 林威志, “一個基於特徵組合之類別資料低維度轉換方法__ 臺灣博碩士論文知識加值系統.” https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dnclcdr&s=id=%22105NTUS5392063%22.&searchmode=basic. (Accessed on 06/17/2018).
    [13] “機器學習十大算法.” http://www.bigdatafinance.tw/index.php/tech/564-2018-03-28-09-55-07. (Accessed on 06/27/2018).
    [14] ccjou, “費雪的判別分析與線性判別分析| 線代啟示錄.” https://ccjou.wordpress.com/2014/03/14/%E8%B2%BB%E9%9B%AA%E7%9A%84%E5%88%A4%E5%88%A5%E5%88%86%E6%9E%90%E8%88%87%E7%B7%9A%E6%80%A7%E5%88%A4%E5%88%A5%E5%88%86%E6%9E%90/. (Accessed on 07/06/2018).
    [15] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” KDD ’16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
    [16] “Phishtank | join the fight against phishing.” https://www.phishtank.com/index.php. (Accessed on 06/18/2018).
    [17] “Website traffic, statistics and analytics - alexa.” https://www.alexa.com/siteinfo. (Accessed on 06/18/2018).
    [18] R. Mohammad, F. Thabtah, and L. McCluskey, “An assessment of features related to phishing websites using an automated technique,” 2012 International Conference for Internet Technology and Secured Transactions, pp. 492–497, 2012.
    [19] “Alexa top list 28.” http://s3.amazonaws.com/alexa-static/top-1m.csv.zip. (Accessed on 06/20/2018).
    [20] “Github - dmlc/xgboost: Scalable, portable and distributed gradient boosting (gbdt, gbrt or gbm) library, for python, r, java, scala, c++ and more. runs on single machine, hadoop, spark, flink and dataflow.” https://github.com/dmlc/xgboost.(Accessed on 06/23/2018).

    QR CODE