簡易檢索 / 詳目顯示

研究生: 林閔笙
Min-Sheng Lin
論文名稱: 利用網址字串資訊萃取之惡意網址過濾器
URL String Information Extraction for Malicious URL Filter
指導教授: 李育杰
Yuh-Jye Lee
口試委員: 鮑興國
Hsing-Kuo Pao
陳昇瑋
Sheng-Wei Chen
黃俊穎
Chun-Ying Huang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 54
中文關鍵詞: 惡意網站大規模資料集極端不平衡資料集
外文關鍵詞: Malicious Website, Large-scale Data Set, Extremely Unbalanced Data Set
相關次數: 點閱:138下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網路服務被廣泛使用,網路上的攻擊,像是網路釣魚網站(Phishing website)或是偷渡式下載(drive-by download),也隨之成為一個非常重要的議題。在我們遇到的問題中,超過一百萬筆網址被送到分析系統來做檢驗,但其中只有大約一百多筆的網址指向惡意網站。在面對如此巨量的資料下,想要對全部的網址抓取網頁內容或是請求主機資訊來分析,都是顯得不太明智的。為了減輕分析系統的負擔,我們提出一個框架,其中僅使用網址字串的資訊來幫忙過濾出可疑的網址。我們從字串中提取字彙資訊與固定存在的屬性,兩者代表了不同的網址字串性質。為了充份利用兩個不同類型的網址資訊中,我們選用了兩種在線學習演算法(online learning algorithm)來各別建立分類用的模型。在我們提出的框架中,兩個分類模型被合併使用,如同以不同角度來檢驗送來的網址字串。從我們的實驗結果中可以得知,這套網址過濾系統能夠在五分鐘內處理完一百多萬筆的資料,並從中取出百分之二十五左右的可疑網址,而其中能夠包含九成以上的惡意網站。這項結果證明了我們提出的框架能夠有效且準確的處理如此巨量的資料。


    By the widespread adoption of web services, attacks over the web become regular threats, such as phishing and drive-by download. In reality, one million of URLs, which only contain about one hundred of malicious instances, are queried to the server for analyzing in one hour. It is impractical to analyze such an overwhelming amount of URLs by utilizing the content-based or host-based information. To overcome this overhead, we propose to use only the string information of the URLs, which are the lexical information and static characteristics of the URL strings, for filtering the malicious URLs. It is worth noting that the lexical information and static characteristics represent different natures of URL string. By exploring these two different kinds of information, two corresponding filters are built via different online learning algorithms. In our framework, the prediction results of these two filters are fused for the testing. In our experiments, the proposed filtering system can handle one million of URLs in 5 minutes and filter out 75% of URLs, which are regarded as benign. The remaining 25% suspicious URLs cover around 90% of the malicious ones. The promising result evidences that our proposed method is efficient and suitable for the analysis of large-scale URLs.

    1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Our Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work 5 3 Data Analysis 7 3.1 Daily Malicious Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 The Rate of Using IP as Domain Name . . . . . . . . . . . . . . . . . . . 8 3.3 Top 10 Malicious Autonomous System Number . . . . . . . . . . . . . . . . 8 4 Feature Extraction From URL String 12 4.1 Sparse Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Dense Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5 Online Learning Algorithm 20 5.1 Passive-Aggressive Algorithm . . . . . . . . . . . . . . . . . . . . . . 21 5.1.1 Hard Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1.2 Soft Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Con dence Weighted Algorithm . . . . . . . . . . . . . . . . . . . . . . . 22 6 Our Proposed Framework 25 6.1 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2 Prediction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 7 Experiments 28 7.1 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 7.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 7.2.1 The Growth of Dictionary Size . . . . . . . . . . . . . . . . . . 29 7.2.2 Performance of Feature Set . . . . . . . . . . . . . . . . . . . 30 7.2.3 Prediction Performance . . . . . . . . . . . . . . . . . . . . . 31 7.2.4 Processing Time . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.2.5 Performance of Dense and Sparse Features . . . . . . . . . . . . 32 8 Conclusion and Future Works 37

    [1] Greg Aaron and Rod Rasmussen. Global phishing survey: Trends and domain name use in 2h2011. http://www.antiphishing.org/reports/APWGGlobalPhishingSurvey 2H2011.pdf, 2012.
    [2] Aaron Blum, Brad Wardman, Thamar Solorio, and Gary Warner. Lexical feature based phishing url detection using online learning. In Proceedings of the 3rd ACM workshop on Arti cial intelligence and security, AISec '10, pages 54{60, New York, NY, USA, 2010. ACM.
    [3] Comodo. Site inspector. http://siteinspector.comodo.com/.
    [4] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551-585, December 2006.
    [5] Mark Dredze, Koby Crammer, and Fernando Pereira. Con dence-weighted linear classification. In Proceedings of the 25th international conference on Machine learning, ICML '08, pages 264-271, New York, NY, USA, 2008.
    [6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classi cation. Journal of Machine Learning Research, 9:1871-1874, June 2008.
    [7] Sujata Garera, Niels Provos, Monica Chew, and Aviel D. Rubin. A framework for detection and measurement of phishing attacks. In Proceedings of the 2007 ACM workshop on Recurring malcode, WORM '07, pages 1-8, New York, NY, USA, 2007. ACM.
    [8] Mike Geide. N-gram character sequence analysis of benign vs. malicious domains/urls. http://analysis-manifold.com/ngram whitepaper.pdf, 2010.
    [9] Google. Google safe browsing. https://developers.google.com/safe-browsing/.
    [10] Anh Le, Athina Markopoulou, and Michalis Faloutsos. Phishdef: Url names say it all. In INFOCOM, pages 191-195. IEEE, 2011.
    [11] Y.J. Lee and O.L. Mangasarian. RSVM: Reduced support vector machines. In SIAM International Conference on Data Mining, pages 00{07. Citeseer, 2001.
    [12] Yuh-Jye Lee and O. L. Mangasarian. SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20:5-22, October 2001.
    [13] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geo rey M. Voelker. Beyond blacklists: learning to detect malicious web sites from suspicious urls. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '09, pages 1245{1254, New York, NY, USA, 2009. ACM.
    [14] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geo rey M. Voelker. Identifying suspicious urls: an application of large-scale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pages 681-688, New York, NY, USA, 2009. ACM.
    [15] MaxMind. Geolite autonomous system number database. http://www.maxmind.com/app/asnum.
    [16] Sorin Mustaca. Phishing, spam and malware statistics for february 2011. http://techblog.avira.com/2011/03/12/ phishing-spam-and-malware-statistics-for-february-2011/en/, 2011.
    [17] Yury Namestnikov. Kaspersky security bulletin. statistics 2011. http://www.securelist.com/en/analysis/204792216/Kaspersky Security Bulletin Statistics 2011, 2012.
    [18] OpenDNS. Phishtank. http://www.phishtank.com/.
    [19] Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song. Design and evaluation of a real-time url spam ltering service. In Proceedings of the 2011 IEEE Symposium on Security and Privacy, SP '11, pages 447{462, Washington, DC, USA, 2011. IEEE Computer Society.
    [20] Colin Whittaker, Brian Ryner, and Marria Nazif. Large-scale automatic classi cation of phishing pages. In NDSS. The Internet Society, 2010.
    [21] Zscaler. Zulu url risk analyzer. http://zulu.zscaler.com/.

    QR CODE