簡易檢索 / 詳目顯示

研究生: 林家齊
Chia-Chi Lin
論文名稱: 機器學習應用於惡意PDF文件檢測與特徵組合分析
Machine Learning Approaches to Malicious PDF Document Detection and Feature Combination Analysis
指導教授: 陳俊良
Jiann-Liang Chen
口試委員: 黃能富
Nen-Fu Huang
呂政修
Jenq-Shiou Leu
洪論評
Lun-Ping Hung
鄧德雋
Der-Jiunn Deng
陳俊良
Jiann-Liang Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 65
中文關鍵詞: 惡意PDF機器學習CatBoost增強特徵特徵分析
外文關鍵詞: Malicious PDF, Machine Learning, CatBoost, Enhancement Features, Features Analysis
相關次數: 點閱:156下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著科技的快速發展,數位化時代為人們的生活帶來了許多益處,但同時也為網路攻擊者提供了更多攻擊的機會。根據Paloalto Network的研究,網路釣魚PDF文件的使用呈急劇增加的趨勢,攻擊者利用社交工程及網路釣魚手法誘騙受害者點擊或下載惡意檔案,以竊取資料或控制設備,從中謀取利益。
    Portable Document Format為Adobe開發的文件格式,特色為保留原始文字、圖像格式並支援跨平台。PDF提供多媒體、互動元素、提供簽名及加密功能等特點,使其成為功能強大的工具,並且在廣大用戶中具有高度信任。然而,PDF的功能及閱讀器中存在的漏洞常被攻擊者所利用,將惡意軟體或腳本隱藏在PDF文件中,從而躲避防毒軟體偵測。
    本研究為了防範惡意PDF文件攻擊,以靜態分析的方式檢測PDF文件,並提出一種基於PDF文件實體和內容特性的檢測系統,內容包含建立特徵、特徵評估與機器學習機制。研究使用之資料集來源來自加拿大網路安全研究所的公開資料集Evasive-PDFMal2022,並結合GitHub和VirusTotal上收集的良性和惡意PDF文件,並切分為訓練集、驗證集和測試集。
    為了檢測惡意PDF文件,本研究基於PDF文件的實體特性和內容結構的邏輯特性,提取了33個特徵並透過CatBoost模型評估各項特徵的重要性。其中基於邏輯的特徵又可區分為基於宣告的特徵和基於功能的特徵兩種型態,當中也引入了5個強化檢測惡意PDF文件的增強特徵,結果顯示5個增強的特徵能有效幫助系統檢測,準確率可達到99.35%,證實本研究提出之惡意PDF檢測系統優於過往之研究。


    With the rapid advancement of technology, the digital generation has brought many benefits to people's lives, but it has also opened more opportunities for cyber attackers. According to research by Paloalto Network, phishing PDFs are on the rise, with attackers using social engineering and phishing techniques to trick victims into clicking on or downloading malicious files to steal data or take control of devices for a profit.
    Portable Document Format (PDF) is a format developed by Adobe that provides original text, image formatting, and cross-platform support. PDF offers multimedia, interactivity, signature, and encryption capabilities, making it a powerful tool and highly trusted by users. However, the functionality of PDFs and vulnerabilities in PDF readers are often exploited by attackers to hide malware or scripts in PDF files to avoid detection by anti-virus software.
    This study proposes a system for detecting PDF files based on their physical and content characteristics, including feature creation, evaluation, and machine learning mechanisms. The dataset used in this study is sourced from the Canadian Institute for Cybersecurity (CIC) public dataset Evasive-PDFMal2022 and additionally combines benign and malicious PDF files collected on GitHub and VirusTotal, and is split into a training dataset, a validation dataset, and a test dataset.
    This study extracted 33 features based on the physical properties of PDF documents and the logical properties of the content structure. It evaluated the importance of each feature through the CatBoost model. Five enhanced features were introduced to improve the detection of malicious PDF documents. The results showed that the five enhanced features were effective in helping the system to detect malicious PDF documents with an accuracy rate of 99.35%, confirming that the proposed malicious PDF detection system is superior to previous studies.

    摘要 I Abstract II List of Figures VI List of Tables VII Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contributions 6 1.3 Organization 7 Chapter 2 Related Work 9 2.1 Portable Document Format Structure 9 2.1.1 Physical Structure 9 2.1.2 Logical Structure 11 2.2 Malicious PDF Document Attack Trends 12 2.3 Evasion Attacks in Malicious PDF Documents Detection 13 2.4 Detection of Malicious PDF Documents 15 Chapter 3 Proposed System 16 3.1 System Architecture 16 3.2 Data Collection 17 3.2.1 Data Source 17 3.3 Feature Definition 18 3.3.1 Physical-based Features 20 3.3.2 Declare-based Features 22 3.3.3 Functional-based Features 25 3.4 Data Processing 29 3.5 Detection Model Architecture 30 Chapter 4 Performance Analysis 31 4.1 System Environment 31 4.1.1 Experiment Environment 31 4.1.2 Experiment Parameter 32 4.2 Performance Evaluation Metrics 34 4.3 Performance Analysis 37 4.3.1 Feature Analysis of Physical-based Features 37 4.3.2 Feature Analysis of Declare-based Features 39 4.3.3 Feature Analysis of Functional-based Features 41 4.3.4 Feature Analysis of All Features 43 4.4 Comparison of Different Study 45 4.5 Summary 46 Chapter 5 Conclusions and Future Works 48 5.1 Conclusions 48 5.2 Future Works 49 References 51

    [1] FBI’s Internet Crime Complaint Center, 2022 Internet Crime Report, Mar. 22, 2023. Accessed: Mar. 25, 2023. [Online]. Available: https://www.ic3.gov/Media/PDF/AnnualReport/2022_IC3Report.pdf.
    [2] SonicWall, 2023 SonicWall Cyber Threat Report, Feb. 28, 2023. Accessed: Apr. 7, 2023. [Online]. Available: https://www.sonicwall.com/2023-cyber-threat-report/.
    [3] Kaspersky, Cybercriminals attack users with 400,000 new malicious files daily – that is 5% more than in 2021, Dec. 1, 2022. Accessed: Mar. 25, 2023. [Online]. Available:https://www.kaspersky.com/about/press-releases/2022_cybercriminals-attack-users-with-400000-new-malicious-files-daily---that-is-5-more-than-in-2021.
    [4] Unit 42, 2020 Phishing Trends With PDF Files, Apr. 5, 2021. Accessed: Apr. 3, 2023. [Online]. Available: https://unit42.paloaltonetworks.com/phishing-trends-with-pdf-files/.
    [5] SentinelOne, Malicious PDFs | Revealing the Techniques Behind the Attacks, Mar. 27, 2019. Accessed: Apr. 5, 2023. [Online]. Available: https://www.sentinelone.com/blog/malicious-pdfs-revealing-techniques-behind-attacks/.
    [6] Adobe, Adobe Security Bulletin, Apr. 28, 2022. Accessed: Apr. 7, 2023. [Online]. Available: https://helpx.adobe.com/security/products/acrobat/apsb21-29.html.
    [7] The Hacker News, Alert: Hackers Exploit Adobe Reader 0-Day Vulnerability in the Wild, May 12, 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://thehackernews.com/2021/05/alert-hackers-exploit-adobe-reader-0.html.
    [8] The MITRE Corporation, CVE-2021-28564, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28564.
    [9] The MITRE Corporation, CVE-2021-28565, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28565.
    [10] The MITRE Corporation, CVE-2021-28550, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28550.
    [11] The MITRE Corporation, CVE-2021-28558, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28558.
    [12] The MITRE Corporation, CVE-2021-28561, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28561.
    [13] The MITRE Corporation, CVE-2021-28559, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28559.
    [14] Cybersecurity | Insiders, Cyber Attack with Ransomware hidden inside PDF Documents, Accessed: Apr. 8, 2023. [Online]. Available: https://www.cybersecurity-insiders.com/cyber-attack-with-ransomware-hidden-inside-pdf-documents/.
    [15] HP Threat Research Blog, PDF Malware Is Not Yet Dead, May 20, 2022. Accessed: Apr. 8, 2023. [Online]. Available: https://threatresearch.ext.hp.com/pdf-malware-is-not-yet-dead/.
    [16] P. Singh, S. Tapaswi, and S. Gupta, "Malware Detection in PDF and Office Documents: A Survey," Information Security Journal: A Global Perspective, vol. 29, no. 3, pp. 134-153, 2020.
    [17] C. Mainka, V. Mladenov, S. Rohlmann, and J. Schwenk, "Vulnerability Report: Attacks Bypassing the Signature Validation in PDF," Mar. 2, 2020. Accessed: Apr. 8, 2023. [Online]. Available: https://pdf-insecurity.org/download/report-pdf-signatures-2020-03-02.pdf.
    [18] S. Rohlmann, V. Mladenov, C. Mainka, and J. Schwenk, "Breaking the Specification: PDF Certification," Proceedings of the IEEE Symposium on Security and Privacy, pp. 1485-1501, 2021.
    [19] S. R. Gopaldinne, H. Kaur, P. Kaur, G. Kaur, and Madhuri, "Overview of PDF Malware Classifiers," Proceedings of the International Conference on Intelligent Engineering and Management, pp. 337-341, 2021.
    [20] J. Müller, D. Noss, C. Mainka, V. Mladenov, and J. Schwenk, "Processing Dangerous Paths - On Security and Privacy of the Portable Document Format," Proceedings of the Network and Distributed System Security Symposium, 2021.
    [21] D. Maiorca, B. Biggio, and G. Giacinto, "Towards Adversarial Malware Detection: Lessons Learned from PDF-Based Attacks," ACM Computing Surveys, vol. 52, no. 4, pp. 1-36, 2019.
    [22] Y. Li, Y. Wang, Y. Wang, L. Ke, and Y. A. Tan, "A Feature-Vector Generative Adversarial Network for Evading PDF Malware Classifiers," Information Sciences, vol. 523, pp. 38-48, 2020.
    [23] T. M. Mohammed, L. Nataraj, S. Chikkagoudar, S. Chandrasekaran, and B. S. Manjunath, "HAPSSA: Holistic Approach to PDF Malware Detection Using Signal and Statistical Analysis," Proceedings of the IEEE Military Communications Conference, pp. 709-714, 2021.
    [24] H. Bae, Y. Lee, Y. Kim, U. Hwang, S. Yoon, and Y. Paek, "Learn2Evade: Learning-Based Generative Model for Evading PDF Malware Classifiers," IEEE Transactions on Artificial Intelligence, vol. 2, no. 4, pp. 299-313, 2021.
    [25] P. P. Chandran, H. R. N, and M. Jeyakarthic, "Intelligent Optimal Gated Recurrent Unit based Malicious PDF Detection and Classification Model," Proceedings of the International Conference on Applied Artificial Intelligence and Computing, pp. 1273-1279, 2022.
    [26] Y. S. Jeong, J. Woo, and A. R. Kang, "Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks," Security and Communication Networks, vol. 2019, 2019.
    [27] A. Falah, L. Pan, S. Huda, S. R. Pokhrel, and A. Anwar, "Improving Malicious PDF Classifier with Feature Engineering: A Data-Driven Approach," Future Generation Computer Systems, vol. 115, pp. 314-326, 2021.
    [28] Y. Cui, Y. Sun, J. Luo, Y. Huang, Y. Zhou, and X. Li, "MMPD: A Novel Malicious PDF File Detector for Mobile Robots," IEEE Sensors Journal, vol. 22, no. 18, pp. 17583-17592, 2022.
    [29] J. Gu, R. Kong, H. Sun, H. Zhuang, F. Pan and Z. Lin, "A Novel Detection Technique Based on Benign Samples and One-Class Algorithm for Malicious PDF Documents Containing JavaScript," Proceedings of the International Conference on Computer Application and Information Security, vol. 12260, pp. 599-607, 2022.
    [30] Evasive-PDFMal2022 | Canadian Institute for Cybersecurity, [Dataset]. Available: https://www.unb.ca/cic/datasets/pdfmal-2022.html.
    [31] PDFMalLyzer, [Online]. Available: https://github.com/ahlashkari/PDFMalLyzer.
    [32] L. Rosenthol, History of PDF Openness, Accessed: Jun. 10, 2023. [Online]. Available:https://web.archive.org/web/20071014010805/http://www.acrobatusers.com/blogs/leonardr/history-of-pdf-openness/.
    [33] M. Issakhani, P. Victor, A. Tekeoglu, and A. H. Lashkaril, "PDF Malware Detection Based on Stacking Learning," Proceedings of the International Conference on Information Systems Security and Privacy, pp.562-570, 2022.

    QR CODE