機器學習應用於惡意PDF文件檢測與特徵組合分析｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	林家齊 Chia-Chi Lin
論文名稱：	機器學習應用於惡意PDF文件檢測與特徵組合分析 Machine Learning Approaches to Malicious PDF Document Detection and Feature Combination Analysis
指導教授：	陳俊良 Jiann-Liang Chen
口試委員:	黃能富 Nen-Fu Huang 呂政修 Jenq-Shiou Leu 洪論評 Lun-Ping Hung 鄧德雋 Der-Jiunn Deng 陳俊良 Jiann-Liang Chen
學位類別：	碩士 Master
系所名稱：	電資學院 - 電機工程系 Department of Electrical Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	65
中文關鍵詞：	惡意PDF 、機器學習、CatBoost 、增強特徵、特徵分析
外文關鍵詞：	Malicious PDF, Machine Learning, CatBoost, Enhancement Features, Features Analysis
相關次數：	點閱：156 下載：5
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著科技的快速發展，數位化時代為人們的生活帶來了許多益處，但同時也為網路攻擊者提供了更多攻擊的機會。根據Paloalto Network的研究，網路釣魚PDF文件的使用呈急劇增加的趨勢，攻擊者利用社交工程及網路釣魚手法誘騙受害者點擊或下載惡意檔案，以竊取資料或控制設備，從中謀取利益。
Portable Document Format為Adobe開發的文件格式，特色為保留原始文字、圖像格式並支援跨平台。PDF提供多媒體、互動元素、提供簽名及加密功能等特點，使其成為功能強大的工具，並且在廣大用戶中具有高度信任。然而，PDF的功能及閱讀器中存在的漏洞常被攻擊者所利用，將惡意軟體或腳本隱藏在PDF文件中，從而躲避防毒軟體偵測。
本研究為了防範惡意PDF文件攻擊，以靜態分析的方式檢測PDF文件，並提出一種基於PDF文件實體和內容特性的檢測系統，內容包含建立特徵、特徵評估與機器學習機制。研究使用之資料集來源來自加拿大網路安全研究所的公開資料集Evasive-PDFMal2022，並結合GitHub和VirusTotal上收集的良性和惡意PDF文件，並切分為訓練集、驗證集和測試集。
為了檢測惡意PDF文件，本研究基於PDF文件的實體特性和內容結構的邏輯特性，提取了33個特徵並透過CatBoost模型評估各項特徵的重要性。其中基於邏輯的特徵又可區分為基於宣告的特徵和基於功能的特徵兩種型態，當中也引入了5個強化檢測惡意PDF文件的增強特徵，結果顯示5個增強的特徵能有效幫助系統檢測，準確率可達到99.35%，證實本研究提出之惡意PDF檢測系統優於過往之研究。

With the rapid advancement of technology, the digital generation has brought many benefits to people's lives, but it has also opened more opportunities for cyber attackers. According to research by Paloalto Network, phishing PDFs are on the rise, with attackers using social engineering and phishing techniques to trick victims into clicking on or downloading malicious files to steal data or take control of devices for a profit.
Portable Document Format (PDF) is a format developed by Adobe that provides original text, image formatting, and cross-platform support. PDF offers multimedia, interactivity, signature, and encryption capabilities, making it a powerful tool and highly trusted by users. However, the functionality of PDFs and vulnerabilities in PDF readers are often exploited by attackers to hide malware or scripts in PDF files to avoid detection by anti-virus software.
This study proposes a system for detecting PDF files based on their physical and content characteristics, including feature creation, evaluation, and machine learning mechanisms. The dataset used in this study is sourced from the Canadian Institute for Cybersecurity (CIC) public dataset Evasive-PDFMal2022 and additionally combines benign and malicious PDF files collected on GitHub and VirusTotal, and is split into a training dataset, a validation dataset, and a test dataset.
This study extracted 33 features based on the physical properties of PDF documents and the logical properties of the content structure. It evaluated the importance of each feature through the CatBoost model. Five enhanced features were introduced to improve the detection of malicious PDF documents. The results showed that the five enhanced features were effective in helping the system to detect malicious PDF documents with an accuracy rate of 99.35%, confirming that the proposed malicious PDF detection system is superior to previous studies.

摘要    I
Abstract    II
List of Figures    VI
List of Tables    VII
Chapter 1    Introduction    1
1    Motivation    1
2    Contributions    6
3    Organization    7
Chapter 2    Related Work    9
1    Portable Document Format Structure    9
1.1    Physical Structure    9
1.2    Logical Structure    11
2    Malicious PDF Document Attack Trends    12
3    Evasion Attacks in Malicious PDF Documents Detection    13
4    Detection of Malicious PDF Documents    15
Chapter 3    Proposed System    16
1    System Architecture    16
2    Data Collection    17
2.1    Data Source    17
3    Feature Definition    18
3.1    Physical-based Features    20
3.2    Declare-based Features    22
3.3    Functional-based Features    25
4    Data Processing    29
5    Detection Model Architecture    30
Chapter 4    Performance Analysis    31
1    System Environment    31
1.1    Experiment Environment    31
1.2    Experiment Parameter    32
2    Performance Evaluation Metrics    34
3    Performance Analysis    37
3.1    Feature Analysis of Physical-based Features    37
3.2    Feature Analysis of Declare-based Features    39
3.3    Feature Analysis of Functional-based Features    41
3.4    Feature Analysis of All Features    43
4    Comparison of Different Study    45
5    Summary    46
Chapter 5    Conclusions and Future Works    48
1    Conclusions    48
2    Future Works    49
References    51

                                

[1] FBI’s Internet Crime Complaint Center, 2022 Internet Crime Report, Mar. 22, 2023. Accessed: Mar. 25, 2023. [Online]. Available: https://www.ic3.gov/Media/PDF/AnnualReport/2022_IC3Report.pdf.
[2] SonicWall, 2023 SonicWall Cyber Threat Report, Feb. 28, 2023. Accessed: Apr. 7, 2023. [Online]. Available: https://www.sonicwall.com/2023-cyber-threat-report/.
[3] Kaspersky, Cybercriminals attack users with 400,000 new malicious files daily – that is 5% more than in 2021, Dec. 1, 2022. Accessed: Mar. 25, 2023. [Online]. Available:https://www.kaspersky.com/about/press-releases/2022_cybercriminals-attack-users-with-400000-new-malicious-files-daily---that-is-5-more-than-in-2021.
[4] Unit 42, 2020 Phishing Trends With PDF Files, Apr. 5, 2021. Accessed: Apr. 3, 2023. [Online]. Available: https://unit42.paloaltonetworks.com/phishing-trends-with-pdf-files/.
[5] SentinelOne, Malicious PDFs | Revealing the Techniques Behind the Attacks, Mar. 27, 2019. Accessed: Apr. 5, 2023. [Online]. Available: https://www.sentinelone.com/blog/malicious-pdfs-revealing-techniques-behind-attacks/.
[6] Adobe, Adobe Security Bulletin, Apr. 28, 2022. Accessed: Apr. 7, 2023. [Online]. Available: https://helpx.adobe.com/security/products/acrobat/apsb21-29.html.
[7] The Hacker News, Alert: Hackers Exploit Adobe Reader 0-Day Vulnerability in the Wild, May 12, 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://thehackernews.com/2021/05/alert-hackers-exploit-adobe-reader-0.html.
[8] The MITRE Corporation, CVE-2021-28564, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28564.
[9] The MITRE Corporation, CVE-2021-28565, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28565.
[10] The MITRE Corporation, CVE-2021-28550, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28550.
[11] The MITRE Corporation, CVE-2021-28558, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28558.
[12] The MITRE Corporation, CVE-2021-28561, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28561.
[13] The MITRE Corporation, CVE-2021-28559, Mar. 16 2021. Accessed: Apr. 7, 2023. [Online]. Available: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28559.
[14] Cybersecurity | Insiders, Cyber Attack with Ransomware hidden inside PDF Documents, Accessed: Apr. 8, 2023. [Online]. Available: https://www.cybersecurity-insiders.com/cyber-attack-with-ransomware-hidden-inside-pdf-documents/.
[15] HP Threat Research Blog, PDF Malware Is Not Yet Dead, May 20, 2022. Accessed: Apr. 8, 2023. [Online]. Available: https://threatresearch.ext.hp.com/pdf-malware-is-not-yet-dead/.
[16] P. Singh, S. Tapaswi, and S. Gupta, "Malware Detection in PDF and Office Documents: A Survey," Information Security Journal: A Global Perspective, vol. 29, no. 3, pp. 134-153, 2020.
[17] C. Mainka, V. Mladenov, S. Rohlmann, and J. Schwenk, "Vulnerability Report: Attacks Bypassing the Signature Validation in PDF," Mar. 2, 2020. Accessed: Apr. 8, 2023. [Online]. Available: https://pdf-insecurity.org/download/report-pdf-signatures-2020-03-02.pdf.
[18] S. Rohlmann, V. Mladenov, C. Mainka, and J. Schwenk, "Breaking the Specification: PDF Certification," Proceedings of the IEEE Symposium on Security and Privacy, pp. 1485-1501, 2021.
[19] S. R. Gopaldinne, H. Kaur, P. Kaur, G. Kaur, and Madhuri, "Overview of PDF Malware Classifiers," Proceedings of the International Conference on Intelligent Engineering and Management, pp. 337-341, 2021.
[20] J. Müller, D. Noss, C. Mainka, V. Mladenov, and J. Schwenk, "Processing Dangerous Paths - On Security and Privacy of the Portable Document Format," Proceedings of the Network and Distributed System Security Symposium, 2021.
[21] D. Maiorca, B. Biggio, and G. Giacinto, "Towards Adversarial Malware Detection: Lessons Learned from PDF-Based Attacks," ACM Computing Surveys, vol. 52, no. 4, pp. 1-36, 2019.
[22] Y. Li, Y. Wang, Y. Wang, L. Ke, and Y. A. Tan, "A Feature-Vector Generative Adversarial Network for Evading PDF Malware Classifiers," Information Sciences, vol. 523, pp. 38-48, 2020.
[23] T. M. Mohammed, L. Nataraj, S. Chikkagoudar, S. Chandrasekaran, and B. S. Manjunath, "HAPSSA: Holistic Approach to PDF Malware Detection Using Signal and Statistical Analysis," Proceedings of the IEEE Military Communications Conference, pp. 709-714, 2021.
[24] H. Bae, Y. Lee, Y. Kim, U. Hwang, S. Yoon, and Y. Paek, "Learn2Evade: Learning-Based Generative Model for Evading PDF Malware Classifiers," IEEE Transactions on Artificial Intelligence, vol. 2, no. 4, pp. 299-313, 2021.
[25] P. P. Chandran, H. R. N, and M. Jeyakarthic, "Intelligent Optimal Gated Recurrent Unit based Malicious PDF Detection and Classification Model," Proceedings of the International Conference on Applied Artificial Intelligence and Computing, pp. 1273-1279, 2022.
[26] Y. S. Jeong, J. Woo, and A. R. Kang, "Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks," Security and Communication Networks, vol. 2019, 2019.
[27] A. Falah, L. Pan, S. Huda, S. R. Pokhrel, and A. Anwar, "Improving Malicious PDF Classifier with Feature Engineering: A Data-Driven Approach," Future Generation Computer Systems, vol. 115, pp. 314-326, 2021.
[28] Y. Cui, Y. Sun, J. Luo, Y. Huang, Y. Zhou, and X. Li, "MMPD: A Novel Malicious PDF File Detector for Mobile Robots," IEEE Sensors Journal, vol. 22, no. 18, pp. 17583-17592, 2022.
[29] J. Gu, R. Kong, H. Sun, H. Zhuang, F. Pan and Z. Lin, "A Novel Detection Technique Based on Benign Samples and One-Class Algorithm for Malicious PDF Documents Containing JavaScript," Proceedings of the International Conference on Computer Application and Information Security, vol. 12260, pp. 599-607, 2022.
[30] Evasive-PDFMal2022 | Canadian Institute for Cybersecurity, [Dataset]. Available: https://www.unb.ca/cic/datasets/pdfmal-2022.html.
[31] PDFMalLyzer, [Online]. Available: https://github.com/ahlashkari/PDFMalLyzer.
[32] L. Rosenthol, History of PDF Openness, Accessed: Jun. 10, 2023. [Online]. Available:https://web.archive.org/web/20071014010805/http://www.acrobatusers.com/blogs/leonardr/history-of-pdf-openness/.
[33] M. Issakhani, P. Victor, A. Tekeoglu, and A. H. Lashkaril, "PDF Malware Detection Based on Stacking Learning," Proceedings of the International Conference on Information Systems Security and Privacy, pp.562-570, 2022.

簡易檢索 / 詳目顯示

相關論文