Basic Search / Detailed Display

Author: 楊鈞旭
Jun-Xu Yang
Thesis Title: 一個改良位元組碼相似度計算之JavaScript惡意程式偵測方法
An improved bytecode similarity measurement for malicious JavaScript code detection
Advisor: 鄧惟中
Wei-Chung Teng
Committee: 林宗男
Tsung-nan Lin
陳俊良
Jiann-Liang Chen
沈上翔
Shan-Hsiang Shen
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2019
Graduation Academic Year: 107
Language: 中文
Pages: 31
Keywords (in Chinese): JavaScript惡意程式碼偵測位元組碼相似度測量
Keywords (in other languages): JavaScript, malicious code detection, bytecode, similarity measurement
Reference times: Clicks: 697Downloads: 2
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 偵測惡意 JavaScript 程式碼的方法主要可以分成兩種: 動態方法以及靜態方法。動態方法通常會使用高互動性 honey clients 或是低互動性的 honey clients 用來偵測惡意行為,靜態方法則主要透過機器學習方法來得知惡意程式碼中的特性,並且最後可以根據這些特性來判斷是否為惡意程式碼。由於惡意軟體的性質會隨著時間而變化,並且惡意軟體也時常會故意破壞有關格式規範的規則或是嘗試使用未定義的行為,因此透過領域知識來針對惡意軟體性質提取特徵也必須因應惡意軟體的變化去做更新,這會導致使用需要用到領域知識的提取特徵會需要額外的開銷,所以能夠讓領域知識 (Domain knowledge) 的使用最小化並使用在提取特徵是相當重要的,所以本研究的主要前提為測量不同對象間位元組碼相似度去偵測惡意程式碼,因為這可以使用到較少的領域知識 (Domain knowledge)。
    Ming Li 等人 [1] 曾提出了Normalized Compression Distance (NCD),一個可以測量任意兩個對象相似度的有效度量,並且已經有許多研究[2][3][4][5]透過使用NCD 去比較檔案的原始位元組碼或是調用一些 API 產生的內容來偵測惡意軟體,目前較新研究,Edward Raf 等人 [6] [7] 提出Lempel-Ziv Jaccard Distance(LZJD),一個在較大型序列中表現會比 NCD 還要好的度量。本研究主要以Edward Raf 等人 [6] [7] 提出的 LZJD 為基準,來更加的提升惡意程式碼偵測率。
    實驗結果顯示,本篇論文提出的架構及演算法,相對於先前研究,對於位元組碼 (Bytecode) 的處理較適當以及擁有較低的程式碼誤判率,假陽性率 (false positive rate) 達到 0.43%、假陰性率(false negative rate) 達到 6.89%。


    Detection of malicious JavaScript code can be classified into two lines: dynamic approaches and static approaches. Dynamic approaches are mostly based on low-interaction honey clients and high-interaction honey clients. Static approaches mainly adopt machine learning techniques to capture characteristics of malicious scripts, and can detect malicious code by characteristics of malicious scripts. Malware classification is subject to concept drift, meaning the nature of malware changes over time. Due to malware often intentionally break rules regarding format specification or attempt undefined behavior, feature extraction based on domain knowledge for malware properties must also be updated in response to changes in malware. It will require additional overhead for feature extraction which is compounded by the changing nature of malware. Therefore, The minimization of domain knowledge is the most important in feature extraction. The main premise of this research is to measure the similarity of bytecode between different objects to detect
    malicious code, because it can use less domain knowledge to detect malicious code.
    In previous research, Ming Li et al [1] proposed the Normalized Compression Distance (NCD), a valid measure that measures the similarity of any two objects. There have been many researches [2] [3] [4] [5] compare the raw byte contents or API call sequences to detect malware by NCD. In latest research, Edward Raf et al [6] [7] proposed the Lempel-Ziv Jaccard Distance(LZJD), a measure that would perform better in larger sequences than NCD. This research will mainly uses LZJD proposed by Edward Raf et al. [6] [7] to improve the detection rate of malicious code.
    The experiments show that the architecture and algorithm proposed in this research give low false positive rate(0.43%) and low false negative rate(6.89%) compared with previous researches. This also represents the preprocessing of the bytecode is better than the previous researches.

    1.緒論 1.1 研究背景 1.2 動機與目的 1.3 研究貢獻 1.4 論文架構 2 相關研究及問題探討 2.1 距離度量 (Distance metric) 2.2 Normalized compression distance 2.3 Lempel-Ziv Jaccard Distance 2.4 Stochastic Hashed Weighted Lempel-Ziv 2.5 問題探討 3 研究方法 3.1 本研究改良方式 3.2 系統流程 3.3 序列轉換 3.3.1 虛擬程式碼及範例 3.3.2 設計要點 3.4 子序列統計 3.4.1 虛擬程式碼及範例 3.4.2 設計要點 4 實驗結果與分析 4.1 使用資料集介紹 4.1.1 資料集來源 4.1.2 資料蒐集 (Crawling) 及資料總數 4.2 評估指標 4.2.1 False positive rate (FPR) 及 False negative rate (FNR) 4.2.2 Accuracy 4.2.3 F1-Measure 4.3 不同演算法及機器學習演算法偵測率比較 4.4 演算法參數與偵測率關係實驗 4.4.1 Max window size 4.4.2 Feature dimension 4.5 偵測時間及效率比較 5 結論

    [1] M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitányi, “The similarity metric,” IEEE
    Transactions on Information Theory, vol. 50, no. 12, pp. 3250–3264, 2004.
    [2] N. Alshahwan, E. T. Barr, D. Clark, and G. Danezis, “Detecting malware with information complexity,” CoRR, vol. abs/1502.07661, 2015.
    [3] J. Andersen, F. Jahanian, Z. M. Mao, J. Nazario, J. Oberheide, and M. Bailey, “Automated Classification and Analysis of Internet Malware,” Recent Advances in Intrusion Detection, pp. 178–197, 2007.
    [4] M. Hayes, A. Walenstein, and A. Lakhotia, “Evaluation of malware phylogeny modelling systems using automated variant generation,” Journal in Computer Virology,
    vol. 5, p. 335, Jul 2008.
    [5] S. Wehner, “Analyzing worms and network traffic using compression,” Journal of
    Computer Security, vol. 15, no. 3, pp. 303–320, 2007.
    [6] E. Raff and C. Nicholas, “An Alternative to NCD for Large Sequences, Lempel-Ziv
    Jaccard Distance,” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17, pp. 1007–1015, 2017.
    [7] E.RaffandC.Nicholas,“Malware Classification and Class Imbalance via Stochastic Hashed LZJD,” Proceedings of the 10th ACM Workshop on Artificial Intelligence
    and Security - AISec ’17, pp. 111–120, 2017.
    [8] Y. M. Wang, D. Beck, X. Jiang, and R. Roussev, “Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities,” Tech. Rep.
    MSR-TR-2005-72, August 2005.
    [9] M. Roesch, “Snort - lightweight intrusion detection for networks,” Proceedings of
    the 13th USENIX Conference on System Administration, pp. 229–238, 1999.
    [10] A. Ikinci, T. Holz, and F. C. Freiling, “Monkey-Spider: Detecting Malicious Web-
    sites with Low-Interaction Honeyclients,” Igarss 2014, no. 1, pp. 1–5, 2014.
    [11] R. S. Borbely, “On normalized compression distance and large malware: Towards
    a useful definition of normalized compression distance for the classification of
    30
    large files,” Journal of Computer Virology and Hacking Techniques, vol. 12, no. 4,
    pp. 235–242, 2016.
    [12] “Rhino - mozilla | mdn.” https://developer.mozilla.org/en-US/docs/
    Mozilla/Projects/Rhino.
    [13] “Alexa.” https://www.alexa.com/topsites.
    [14] “Scumware.” https://www.scumware.org/.
    [15] “Clean-mx.” https://support.clean-mx.com/clean-mx/viruses.php.
    [16] “Github:hynekpetrak/ javascript-malware-collection.” https://github.com/
    HynekPetrak/javascript-malware-collection.
    [17] “Github:geeksonsecurity/ js-malicious-dataset.” https://github.com/
    geeksonsecurity/js-malicious-dataset.
    [18] “Virustotal.” https://www.virustotal.com/gui/home/upload.

    QR CODE