研究生: |
葉日揚 Jih-Yang Yeh |
---|---|
論文名稱: |
一個結合極限梯度提升分類模型與關鍵字抽出方法的釣魚網站偵測服務架構 A Phishing Website Detection Service Mechanism Utilizing XGBoost Classification Model and Key-term Extraction Method |
指導教授: |
鄧惟中
Wei-Chung Teng |
口試委員: |
林宗男
Tsung-Nan Lin 卓政宏 Cheng-Hung Cho 鄧惟中 Wei-Chung Teng 王勝德 Sheng-De Wang 沈上翔 Shan-Hsiang Shen |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 39 |
中文關鍵詞: | 釣魚網站 、極限梯度提升演算法 、關鍵字偵測 、特徵前處理 、滑動窗口 |
外文關鍵詞: | phishing attack, XGBoost, keyterm detection, pre-processing, sliding window |
相關次數: | 點閱:308 下載:13 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究提出了一個結合一般釣魚網頁分類器與關鍵字抽出方法的釣魚網頁偵測架構,利用近年來kaggle 競賽中表現優異的學習演算法XGBoost 建構分類模型,嘗試使用不同的前處理方法增進效能,並加入關鍵字抽取以協助降低釣魚網站分類模型的誤判率。
關鍵字抽取方法的發想來自於以下的觀察:進行網路釣魚的攻擊者通常試圖使釣魚網頁看起來與其模仿目標相似,因此釣魚網頁中極可能會在多處留下線索,透露出其模仿對象,因此我們透過多來源比對來找出模仿目標名或與模仿目標極相關的關鍵字;另一方面,合法網頁在搜尋引擎中的排名必定靠前,因此以關鍵字搜索的結果排名即可做為判斷釣魚網站的參考。此方法主要達到的效果為可抓出模仿特定目標的釣魚網頁,與修正被誤判的合法網頁。另外本架構使用了滑動窗口機制降低訓練成本,以少量的訓練資料即可訓練出同樣效能的模型。
在效果驗證部分,我們使用來自PhishTank 與Alexa 上取得的標記後資料進行實驗。在加入關鍵字抽取方法後,能有效地修正被誤判的合法網頁,而達到99%以上的準確率。
This research proposes a phishing website detection mechanism that combines an XGBoost based phishing website classifier and the key-term extraction method. Some pre-processing techniques are also developed to enhance the performance. XGBoost is well known for its high efficiency and accuracy, and the key-term based detection method helps to minimize the false positive rate of the phishing website classification model.
The key-term extraction method is based on two observation: Phishers usually try to make phishing websites look similar to their imitation targets, therefore there must be clues, or key terms, behind website related sources that reveal their imitation target; On the other hand, legitimate websites must be ranked high in search engines, so the ranking of search results of key terms serve as a good reference. The main function of this method is to capture the specific target of the phishing website if there has one and correct the legitimate websites that are misclassified. In addition, the proposed mechanism introduces a sliding window technique to reduce training costs, so as to reach the same performance with the smaller training data.
The framework proposed in this research uses the data crawling from PhishTank and Alexa, and experiments are conducted after labeling. Without the key-term detection method, the accuracy rate is about 98\%. After enabling the key-term method, the number of the misclassified legitimate website is further reduced such that the accuracy rate raised to 99%.
[1] Kaspersky, “Anti-phishing technology,” 2012.
[2] “淺談釣魚攻擊之釣魚網站.” http://www.cc.ntu.edu.tw/chinese/epaper/0040/20170320_4009.html. (Accessed on 07/29/2018).
[3] APWG, “Phishing activity trends report Q4 2017,” no. December, p. 11, 2017.
[4] M. Khonji, Y. Iraqi, and A. Jones, “Phishing detection: A literature survey,” 2013.
[5] D. Sahoo, C. Liu, and S. C. H. Hoi, “Malicious url detection using machine learning:A Survey,” 2017.
[6] “Google safe browsing | google developers.” https://developers.google.com/safe-browsing/?csw=1. (Accessed on 06/20/2018).
[7] M. Atighetchi and P. Pal, “Attribute-based prevention of phishing attacks,” Proceedings- 2009 8th IEEE International Symposium on Network Computing and Applications,NCA2009, pp. 266–269, 2009.
[8] N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell, “Client-side defense against web-based identity theft,” The Network and Distributed System Security Symposium 2004, pp. 1–16, 2004.
[9] A. K. Jain, “Comparative Analysis of Features Based Machine Learning Approaches for Phishing Detection,” 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 2125–2130, 2016.
[10] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, “CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites,” ACM Transactions on Information and System Security, vol. 14, no. 21, 2011.
[11] M. Darling, G. Heileman, G. Gressel, A. Ashok, and P. Poornachandran, “A lexical approach for classifying malicious URLs,” Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015, pp. 195–202, 2015.
[12] 林威志, “一個基於特徵組合之類別資料低維度轉換方法__ 臺灣博碩士論文知識加值系統.” https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dnclcdr&s=id=%22105NTUS5392063%22.&searchmode=basic. (Accessed on 06/17/2018).
[13] “機器學習十大算法.” http://www.bigdatafinance.tw/index.php/tech/564-2018-03-28-09-55-07. (Accessed on 06/27/2018).
[14] ccjou, “費雪的判別分析與線性判別分析| 線代啟示錄.” https://ccjou.wordpress.com/2014/03/14/%E8%B2%BB%E9%9B%AA%E7%9A%84%E5%88%A4%E5%88%A5%E5%88%86%E6%9E%90%E8%88%87%E7%B7%9A%E6%80%A7%E5%88%A4%E5%88%A5%E5%88%86%E6%9E%90/. (Accessed on 07/06/2018).
[15] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” KDD ’16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
[16] “Phishtank | join the fight against phishing.” https://www.phishtank.com/index.php. (Accessed on 06/18/2018).
[17] “Website traffic, statistics and analytics - alexa.” https://www.alexa.com/siteinfo. (Accessed on 06/18/2018).
[18] R. Mohammad, F. Thabtah, and L. McCluskey, “An assessment of features related to phishing websites using an automated technique,” 2012 International Conference for Internet Technology and Secured Transactions, pp. 492–497, 2012.
[19] “Alexa top list 28.” http://s3.amazonaws.com/alexa-static/top-1m.csv.zip. (Accessed on 06/20/2018).
[20] “Github - dmlc/xgboost: Scalable, portable and distributed gradient boosting (gbdt, gbrt or gbm) library, for python, r, java, scala, c++ and more. runs on single machine, hadoop, spark, flink and dataflow.” https://github.com/dmlc/xgboost.(Accessed on 06/23/2018).