簡易檢索 / 詳目顯示

研究生: 曹祐
YOU CAO
論文名稱: 以測量相似度實現惡意JavaScript 偵測
Malicious JavaScript detection using similarity measurement
指導教授: 鄧惟中
Wei-Chung Teng
口試委員: 陳俊良
Jiann-Liang Chen
項天瑞
Tien-Ruey Hsiang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 31
中文關鍵詞: 低領域知識需求的惡意程式偵測相似度測量機器學 習JavaScript
外文關鍵詞: Malicious code detection with low domain knowledge, Similarity measurement, Machine learning, JavaScript
相關次數: 點閱:339下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在JavaScript 隨處可見的今日,如何迅速找出描述著惡意行為的JavaScript 是個重
    要的議題。判斷程式是否有惡意行為的方法主要可以分為兩大類:靜態和動態。
    無論是動態還是靜態,往往都需要大量的領域知識(Domain knowledge),這對於
    安全人員是個沉重的負擔。本研究旨在找出一個低領域知識的惡意JavaScript 偵
    測方法。而靜態方法的成本通常較動態低,利於規模化,所以我們從靜態方法著
    手。
    近年的研究中證明了使用相似度來分類惡意軟體是有效的[1] [2] [3],這種靜
    態方法的優勢是不需要大量的領域知識,只透過比較程式的位元組碼(bytecode)
    便能判斷程式是否有惡意行為。其中Edward Raff 等人透過改良Ming Li 等人所提
    出的通用相似度測量方法Normalize Compression Distance(NCD) [4],開發出更適
    合用於大型序列的相似度測量方法LempelZiv
    Jaccard Distance(LZJD) [5],再以其
    為基底,提出一個惡意程式分類演算法Malware Classification and Class Imbalance
    via Stochastic Hashed LZJD (SHWeL) [1]。它可以對惡意程式進行分類,判斷某
    個惡意軟體是Windows 的執行檔還是Android 的應用程式。於是我們試圖改良
    SHWeL,將其使用在惡意JavaScript 偵測上。改良後的偵測系統雖然失去分類功
    能,但也得到了更高的偵測準確度。
    實驗結果顯示,本研究提出的架構和演算法的假陰性率(False negative rate) 達
    到7.21%,相較於改良前的方法能降低9% 左右。


    Nowadays, ”How to detect malicious JavaScript rapidly.” is an important issue since
    JavaScript is used everywhere. There are two classification of ways to determine a program
    has malicious behavior or not: dynamic and static. Both of them have a heavy
    requirement of domain knowledges, it’s a heavy burden for researchers. Our research
    objective is finding a method that can detect malicious JavaScript without heavy domain
    knowledge requirement. Since the static approaches usually have lower cost and greater
    scalability than dynamic approaches, our research would start with static approach.
    Researches in recent years tell that classify malware by comparing similarity is work
    [1] [2] [3], the advantage of this static approach is that it can detect malicious behaviors
    by analysing the bytecode of program without a heavy requirement of domain knowledge.
    Edward Raff et al proposed the LempelZiv
    Jaccard Distance(LZJD) [5] which is
    improved from the Normalized Compression Distance(NCD) [4] and get a better performance
    at comparing the similarity of any two large sequences. After that, they developed
    a malware classification algorithm based on the LZJD which named Malware Classification
    and Class Imbalance via Stochastic Hashed LZJD (SHWeL) [1]. The feature of this
    algorithm is that the domain knowledge of malware is not necessary. Users are able to
    classify malware by comparing similarity of bytecodes without learning knowledge about
    them. The advantage of SHWeL is matching our research objective so that we decide to
    use SHWeL as the base of our malicious JavaScript detection system.
    Finally, we proposed a malicious JavaScript detection system that based on improved
    SHWeL in this thesis. The experiments show that the system give a 7.21% false negative
    rate, it’s 9% lower than previous researches. This also represents the approach of comparing
    similarity works well in malicious JavaScript detection.
    II

    1.緒論 1.1 研究背景 1.2 動機與目的 1.3 研究貢獻 1.4 論文架構 2 相關研究及問題探討 2.1 距離度量 (Distance metric) 2.2 Normalized compression distance 2.3 Lempel-Ziv Jaccard Distance 2.4 Stochastic Hashed Weighted Lempel-Ziv 3 研究方法 3.1 改良方法 3.2 系統流程 3.3 向量建構 3.3.1 簡介 3.3.2 虛擬程式碼及範例 3.3.3 設計要點 4 實驗結果與分析 4.1 資料集與實驗環境 4.1.1 資料集來源 4.1.2 資料蒐集及資料總數 4.1.3 資料集公開 4.1.4 實驗環境 4.2 評估指標 4.2.1 False positive rate (FPR) 及 False negative rate (FNR) 4.2.2 Accuracy 4.2.3 Precision、Recall 和F1-Measure 4.3 不同演算法及機器學習模型偵測率比較 4.4 演算法參數實驗 5 結論

    [1] E. Raff and C. Nicholas, “Malware Classification and Class Imbalance via Stochastic Hashed LZJD,” Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 111–120, 2017.
    [2] N. Alshahwan, E. T. Barr, D. Clark, and G. Danezis, “Detecting Malware with Information Complexity,” 2015.
    [3] M. Hayes, A. Walenstein, and A. Lakhotia, “Evaluation of malware phylogeny modelling systems using automated variant generation,” Journal of Computer Virology and Hacking Techniques, 2008.
    [4] M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitányi, “The similarity metric,” IEEE Transactions on Information Theory, vol. 50, no. 12, pp. 3250–3264, 2004.
    [5] E. Raff and C. Nicholas, “An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance,” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1007–1015, 2017.
    [6] Microsoft, “Microsoft Security Intelligence Report Volume 23.” https://info.microsoft.com/rs/157-GQE-382/images/EN-US_CNTNT-eBook-SIR-volume-23_March2018.pdf, 2018.
    [7] “Cyber Threat Alliance Releases Analysis of Illicit Cryptocurrency Mining.” https://securingtomorrow.mcafee.com/blogs/other-blogs/mcafee-labs/cyber-threat-alliance-releases-analysis-of-illicit-cryptocurrency-mining/. (Accessed on 02/11/2019).
    [8] R. S. Borbely, “On normalized compression distance and large malware: Towards a useful definition of normalized compression distance for the classification of large files,” Journal of Computer Virology and Hacking Techniques, vol. 12, pp. 235–242,2016.
    [9] Microsoft, “Microsoft Security Intelligence Report Volume 21. ”https://www.microsoft.com/security/blog/2016/12/14/microsoft-security-intelligence-report-volume-21-is-now-available/,2016.
    [10] “McAfee Labs Threat Report September 2016.” https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-sep-2016.pdf. (Accessed on 02/11/2019).
    [11] Microsoft, “Microsoft Security Intelligence Report Volume 24.” https://info.microsoft.com/ww-landing-M365-SIR-v24-Report-eBook.html, 2019.
    [12] M. AKIYAMA, M. IWAMURA, Y. KAWAKOYA, K. AOKI, and M. ITOH, “Designand Implementation of High Interaction Client Honeypot for Drive-by-Download-Attacks,” Institute of Electronics, Information and Communication Engineers, 2010.
    [13] J. Wang, Y. Xue, Y. Liu, and T. H. Tan, “JSDC: A Hybrid Approach for JavaScript Malware Detection and Classification,” Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, pp. 109–120, 2015.
    [14] “Rhino Mozilla.”https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino. (Accessed on 02/11/2019).
    [15] “Java bytecode instruction listings.” https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings. (Accessed on 02/11/2019).
    [16] “Alexa.” https://www.alexa.com/topsites.
    [17] “Scumware.” https://www.scumware.org/.
    [18] “Cleanmx.”https://support.clean-mx.com/clean-mx/viruses.php.
    [19] “Github:hynekpetra/ajavascriptmalwarecollection.”https://github.com/HynekPetrak/javascript-malware-collection.
    [20] “Github:geeksonsecurity/ajsmaliciousdataset.”https://github.com/geeksonsecurity/js-malicious-dataset.
    [21] “Virustotal.” https://www.virustotal.com.
    [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, 2011.

    QR CODE