簡易檢索 / 詳目顯示

研究生: 蘇克維
Ke-wei Su
論文名稱: 可疑連結過濾器基於羅吉斯迴歸與多觀點分析
Suspicious URL Filter based on Logistic Regression with Multi-view Analysis
指導教授: 李漢銘
Hahn-Ming Lee
口試委員: 林豐澤
Feng-tza Lin
鄭博仁
Bo-Ren Jeng
李育杰
Yu-jie Lee
鮑興國
Shing-guo Bau
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 45
中文關鍵詞: 惡意連結
外文關鍵詞: malicious URL
相關次數: 點閱:179下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

目前網際網路環境的蓬勃發展,資訊系統廣泛的應用,然而惡意連結頻傳,竊取重要資訊的事件時有所聞,偵測惡意連結成為網路資訊安全的基本防護。由於駭客攻擊的手法日新月異,(例如在連結中插入正常的代符),利用提高自身連結的可信誠度以躲避利用分析連結的惡意連結偵測方法。當誤判率過高,網路人員就必須浪費更多的時間對每個網站做更詳細的分析,而無法即時對使用者提出惡意連結的警告。因此本文提出即時過濾可疑連結,以降低需要做詳細分析的連結數量。
在此篇研究中,我們提出一個多觀點分析以降低困惑技巧的影響。每個連結都是很多的代符建構而成,而每個代符都代表著不同的意義(例如區域代碼或路徑)。駭客會利用組合代符的困惑技巧在不同的區塊上以提升自身連結的可信度,例如地區或者是檔案類別。而這些困惑技巧通常都有著自己的行為模式。我們提出的機制就在於去學習這些行為模式以為了辨識每個連結的可疑程度,我們將連結切成幾個區塊,並利用羅吉斯迴歸對這些區塊學習他們的行為模式以避免其他區塊的影響。再根據貪婪原則找出適合每個區塊的係數以制定可疑程度的門檻來比較每個連結的可疑程度。最後,我們將過濾出可疑的連結送入其他系統來降低需要作內容分析的連結數量。實驗結果顯示,我們所提出方式有較低的誤報率以及符合業界所需的要求。


The current malicious URLs detecting techniques based on URL analysis are hard to find the malicious URLs infected via the obfuscated techniques (e.g., insertion of benign tokens). In this study, we propose an approach based on multi-view in order to reduce the impact from obfuscated techniques. The URLs are composed with several tokens, and each token has different meaning. The hackers use different obfuscated techniques with token combination on different portions, and these techniques have their own behavior. This mechanism intends to learn the behaviors from different portions of URLs (e.g., authority portions) for identifying the level of suspicion of each portion. With comparing the suspicious level of each parts between each URLs, this system would select the most suspicious URLs. This thesis makes following contributions:
(1) Provide a multi-view mechanism for reducing the effect from obfuscated techniques, (2) Automatic filtering out the suspicious URLs without the need for additional configuration and modification in automatic way, (3) dealing with large scale
and unbalance data with effectiveness, and (4) satisfying the requirements of industry.
In the system evaluation, this thesis uses the real data set from T. Co.. According to the requirements of T. Co.: (1) detection rate should be less than 25%, (2) missing rate should be lower than 25%, and (3) the process with one hour data should be end in i a hour. The experimental results show that our approach is effective, and is with the ability to find more malicious URLs and satisfy the requirements given by practical
environment as well as T. Co..

1 Introduction 1 1.1 Motivation 1.2 Problem Definition 1.3 Goals 1.4 Thesis Contributions 1.5 Outlines of the Thesis 2 Background and Related Work 2.1 Background of Malicious URL 2.2 Related Work of Detecting Malicious URLs 2.2.1 Black List iii 2.2.2 Classification of URLs 2.2.3 Classification in Related Contexts 2.2.4 Non-Machine Learning Approaches 3 Suspicious URL Filtering 3.1 URL Token Composition Extractor 3.1.1 URL Syntax Components Extractor 3.1.2 Lexical Feature Parser based on BOW 3.2 Multi-View Malicious URL Behavior Learner based on Logistic Regression 3.3 Filtering Threshold Calculator with Greedy Strategy(training phase) 3.3.1 Suspicious Authority, Directory and Argument Estimator 3.3.2 Thresholds Finder with Greedy Strategy 3.4 Suspicious URL Finder(testing phase) 3.4.1 Suspicious Authority, Directory and Argument Estimator 3.4.2 Suspicious URL filter 4 Experiment Results 4.1 Experiment Design 4.2 Dataset Description 4.3 Evaluation Metrics and Experiment Setup 4.4 Experimental Result and Discussion 5 Conclusions and Further Work 5.1 Discussion 5.2 Conclusion 5.3 Further Work

[1] “Goole Safe Browsing API.” [Online]. Available: http://code.google.com/apis/
safebrowsing/
[2] “HP report.” [Online]. Available: http://h30499.www3.hp.com/
[3] “IronPort.” [Online]. Available: http://www.senderbase.org/
[4] “McAfee SiteAdvisor.” [Online]. Available: http://www.siteadvisor.com
[5] “Microsoft Smart Screen.” [Online]. Available: http://windows.microsoft.com/
en-US/
[6] “PhishTank.” [Online]. Available: http://www.phishtank.com
[7] “URI scheme.” [Online]. Available: http://en.wikipedia.org/wiki/URIn scheme
[8] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair, “A comparison of machine
learning techniques for phishing detection,” in Proceedings of the anti-phishing
working groups 2nd annual eCrime researchers summit. ACM, 2007, pp. 60–69.
[9] M. Antonakakis, R. Perdisci,W. Lee, N. Vasiloglou II, and D. Dagon, “Detecting malware domains at the upper dns hierarchy,” in Proceedings of the 20th USENIX Security Symposium, USENIX Security, vol. 11, 2011, pp. 27–38.
[10] A. Bergholz, J. Chang, G. Paas, F. Reichartz, and S. Strobel, “Improved phishing detection using model-based features,” in Proceedings of the Conference on Email and Anti-Spam (CEAS), 2008, pp. 31–41.
[11] L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi, “Exposure: Finding malicious domains using passive dns analysis,” in Proceedings of the 2011 Network and Distributed System Security Symposium, 2011, pp. 17–34.
[12] D. Canali, M. Cova, G. Vigna, and C. Kruegel, “Prophiler: A fast filter for the large-scale detection of malicious web pages,” in Proceedings of the 20th international conference on World wide web. ACM, 2011, pp. 197–206.
[13] M. Dredze, K. Crammer, and F. Pereira, “Confidence-weighted linear classification,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 264–271.
[14] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR:
A library for large linear classification,” Journal of Machine Learning Research, pp. 1871–1874.
[15] I. Fette, N. Sadeh, and A. Tomasic, “Learning to detect phishing emails,” in Proceedings of the 16th international conference on World Wide Web. ACM, 2007, pp. 649–656.
[16] S. Garera, N. Provos, M. Chew, and A. Rubin, “A framework for detection and
measurement of phishing attacks,” in Proceedings of the 2007 ACM workshop on
Recurring malcode. ACM, 2007, pp. 1–8.
[17] P. Kolari, T. Finin, and A. Joshi, “Svms for the blogosphere: Blog identification and splog detection,” in Proceedings of AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, vol. 4, 2006, pp. 1–9.
[18] A. Le, A. Markopoulou, and M. Faloutsos, “Phishdef: Url names say it all,” in INFOCOM, 2011 Proceedings IEEE. IEEE, 2011, pp. 191–199.
[19] C. Lin, R.Weng, and S. Keerthi, “Trust region newton method for logistic regression,”The Journal of Machine Learning Research, vol. 9, pp. 627–650, 2008.
[20] J. Ma, L. Saul, S. Savage, and G. Voelker, “Beyond blacklists: learning to detect malicious web sites from suspicious urls,” in Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining.
ACM, 2009, pp. 1245–1254.
[21] J. Ma, L. Saul, S. Savage, and G. Voelker, “Identifying suspicious urls: an application of large-scale online learning,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 681–688.
[22] J. Ma, L. Saul, S. Savage, and G. Voelker, “Learning to detect malicious urls,”ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3,pp. 30–60, 2011.
[23] D. McGrath and M. Gupta, “Behind phishing: an examination of phisher modi
operandi,” in Proceedings of the 1st Usenix Workshop on Large-Scale Exploits
and Emergent Threats. USENIX Association, 2008, pp. 1–8.
[24] A. Moshchuk, T. Bragin, D. Deville, S. Gribble, and H. Levy, “Spyproxy:
Execution-based detection of malicious web content,” in Proceedings of 16th
USENIX Security Symposium on USENIX Security Symposium. USENIX Association,
2007, pp. 3–18.
[25] A. Moshchuk, T. Bragin, S. Gribble, and H. Levy, “A crawler-based study of
spyware on the web,” in Proceedings of the 2006 Network and Distributed System
Security Symposium, 2006, pp. 17–33.
[26] N. Provos, P. Mavrommatis, M. Rajab, and F. Monrose, “All your iframes point to us,” in Proceedings of the 17th conference on Security symposium. USENIX Association, 2008, pp. 1–15.
[27] S. Shevade and S. Keerthi, “A simple and efficient algorithm for gene selection using sparse logistic regression,” Bioinformatics, vol. 19, no. 17, pp. 2246–2253,2003.
[28] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song, “Design and evaluation of a real-time url spam filtering service,” in Security and Privacy (SP), 2011 IEEE Symposium on. IEEE, 2011, pp. 447–462.
[29] Y. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and S. King,
“Automated web patrol with strider honeymonkeys,” in Proceedings of the 2006
Network and Distributed System Security Symposium, 2006, pp. 35–49.
[30] C. Whittaker, B. Ryner, and M. Nazif, “Large-scale automatic classification of phishing pages,” Proceedings of 17th Network and Distributed System Security
Symposium, pp. 14–27, 2010.

QR CODE