基於圖神經網路之惡意程式分類與相似性分析｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	鄧仁豐 Ren-Feng Deng
論文名稱：	基於圖神經網路之惡意程式分類與相似性分析 Malware Family Classification and Similarity Based on Graph Neural Networks
指導教授：	陳俊良 Jiann-Liang Chen
口試委員:	郭耀煌 Yau-Hwang Ku 孫雅麗 Yea-li Sun 廖婉君 Wan-jiun Liao 黎碧煌 Bih-Hwang Lee
學位類別：	碩士 Master
系所名稱：	電資學院 - 電機工程系 Department of Electrical Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	74
中文關鍵詞：	惡意程式家族、圖神經網路、暹羅網路、深度學習、表徵學習
外文關鍵詞：	Malware Families, Graph Neural Networks, Siamese Network, Deep Learning, Representation Learning
相關次數：	點閱：223 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在這個全球化的時代，資訊科技與通訊技術蓬勃發展，使得電腦網路迅速普及到大眾的日常生活中。甚至，聯合國通過一項決議，其認為使用網路是一項基本人權，對於現代生活已經是一中基本需要。同時也呼籲各國應妥善解決資訊安全問題，確保網路使用的自由。然而，在COVID-19的疫情肆虐下，強迫全球開始推動遠距辦公。這使得多數企業所面臨的資安風險增加，其中，惡意程式更是一大威脅，對企業或個人都造成嚴重的傷害。
許多大型安全公司每天都會收到大量的惡意程式樣本，惡意程式不斷變種造成分析人員的負擔，所以如何識別已知惡意程式的變種是一個重要的問題。因此，本研究提出基於圖神經網路之惡意程式家族識別模型，其藉由解析惡意程式取得函數呼叫關係以及函數組合語言內容，建立一個代表惡意程式函數結構的圖。以惡意程式的函數與函數之間的呼叫關係分別作為圖的節點與邊。此外，也透過表徵學習模型學習組合語言的潛在語意，將函數行為嵌入向量表示做為節點的特徵。
本研究除了建立一個預測固定類別的多分類模型外，也實作了一個基於度量學習的相似性模型。然而，有別於多分類模型會受限於面對新的類別時，必須完整資料重新訓練。相似性模型是以衡量兩個樣本在向量空間中的彼此的距離作為依據，評估其是否屬於同一類別，並在訓練過程中逐漸調整樣本之間的距離。因此，當模型需要擴充時，相似性模型具有較好的彈性與表現。
最後，本研究比較了相似性模型與先前的研究的效能表現，同時也視覺化相似性模型的輸出來進行結果分析。相似性模型在測試資料集與未見過資料集的準確度分別達到92%與70.4%。綜上所述，根據數據結果表明，本研究所提出的方法優於先前的研究。

In this era of globalization, information technology and communication technologies are booming, making the computer network rapidly popular in the daily life of the public. The United Nations has even passed a resolution that the use of the Internet is a fundamental human right and a basic need for modern life. Meanwhile, it also called on countries to address information security issues and ensure the freedom of Internet use. However, the rampant epidemic of the COVID-19 has forced the world to telecommute. This has increased the risk of information security for most businesses, and malware is a major threat that can cause serious harm to businesses and individuals alike.
Many large security companies receive many malware samples every day. The continual mutation of malware imposes a burden on malware analysts. Identifying the variants of known malware is an important task. Therefore, this study proposes a malware family identification model that is based on a graph neural network. The function call relationship and the function assembly content are obtained by analyzing the malware to generate a graph that represents the functional structure of the malware. The function of the malware and the calling relationship between the functions are regarded as the nodes and edges of the graph, respectively. In addition, the latent semantics of the assembly code are also learned through the representation learning model, and the functional behavior embedding vector is expressed as the feature of the node.
As well as establishing a multi-classification model for predicting fixed classes, this study also implements a similarity model that is based on a distance metric learning. However, the classification model will be limited when facing new classes and must be retrained with entire dataset. The similarity model is based on measuring the distance between two samples in the vector space, assessing whether they belong to the same class. Besides, it will gradually adjust the distance between the samples during the training process to improve performance. Therefore, when the model needs to be expanded, the similarity model has better flexibility and performance.
Finally, the performance of the similarity model is analyzed, and its output is visualized. The accuracies of the similarity model when applied to a testing dataset and an unseen dataset were 92% and 70.4%, respectively. In summary, according to the data results, the method proposed in this study is better than previous studies.

摘要 I
Abstract II
List of Figures VII
List of Tables IX
Chapter 1 Introduction 1
1 Motivation 1
2 Contributions 7
3 Organization 9
Chapter 2 Related Work 10
1 Malware Concept 10
2 Malware Analysis 14
2.1 Static Analysis 14
2.2 Dynamic Analysis 19
3 Malware Detection Techniques 21
3.1 Signature Based 21
3.2 Heuristic Based 22
3.3 Machine/Deep Learning 24
Chapter 3 Proposed Methods 27
1 Methods Overview 27
2 Data Collection 28
3 Dataset Construction 29
3.1 Disassemble Malware Samples 30
3.2 Function Embedding 30
3.3 Data Preprocessing 32
4 Classification Model and Prediction 34
4.1 Graph Neural Networks 34
4.2 Classification Model Architecture 36
5 Similarity Model and Measure 38
5.1 Siamese Network 38
5.2 Similarity Model Architecture 40
Chapter 4 Performance Analysis 43
1 Experimental Environment 43
2 Experimental Performance 44
2.1 Classification Model 44
2.2 Similarity Model 46
3 Performance Comparison 49
Chapter 5 Conclusions and Future Works 53
1 Conclusions 53
2 Future Works 54
References 56
                                

[1] Businesswire, "Advanced Persistent Threats in 2021: Kaspersky researchers predict new threat angles and attack strategies to come," Available: https://www.businesswire.com/news/home/20201119005817/en/Advanced-Persistent-Threats-in-2021-Kaspersky-Researchers-Predict-New-Threat-Angles-and-Attack-Strategies-to-Come [Accessed: 23-Apr-2021].
[2] Lockheed Martin, "Cyber Kill Chain®," Available: https://www.lockheedmartin.com/en-us/capabilities/cyber/cyber-kill-chain.html [Accessed: 30-Apr-2021].
[3] 數位時代, "WannaCry全球網攻滿兩週年：全球百萬台設備仍陷風險，台灣成重災區," Available: https://www.bnext.com.tw/article/53267/wannacry-cybersecurity-twoyears [Accessed: 30-Apr-2021].
[4] Shodan, "EternalBlue vulnerabilities (May 12)," Available: https://www.shodan.io/report/S8dhzrSn [Accessed: 30-Apr-2021].
[5] Symantec, “Threat Landscape Trends – Q3 2020,” Available: https://symantec-enterprise-blogs.security.com/blogs/threat-intelligence/threat-landscape-trends-q3-2020 [Accessed: 30-Apr-2021].
[6] G DATA, "G DATA threat analysis 2020: cyber attacks every second," Available: https://www.gdatasoftware.com/news/2021/02/36663-g-data-threat-analysis-2020-cyber-attacks-every-second [Accessed: 30-Apr-2021].
[7] AV-TEST, "The AV-TEST Security Report 2019/2020," Available: https://www.av-test.org/fileadmin/pdf/security_report/AV-TEST_Security_Report_2019-2020.pdf [Accessed: 30-Apr-2021].
[8] P. Black, I. Gondal, and R. Layton, "A survey of similarities in banking malware behaviours," Computers & Security, vol. 77, pp. 756–772, 2018.
[9] phishingbox, "Verizon Data Breach Investigations Report (DBIR) – 2019," Available: https://www.phishingbox.com/news/phishing-news/verizon-data-breach-investigations-report-dbir-2019 [Accessed: 30-Apr-2021].
[10] H. Darabian, S. Homayounoot, A. Dehghantanha, S. Hashemi, H. Karimipour, R. M. Parizi, and K.-K. R. Choo, "Detecting Cryptomining Malware: a Deep Learning Approach for Static and Dynamic Analysis," Journal of Grid Computing, vol. 18, no. 2, pp. 293–303, 2020.
[11] P. Burnap, R. French, F. Turner, and K. Jones, "Malware classification using self organising feature maps and machine activity data," Computers & Security, vol. 73, pp. 399–410, 2018.
[12] S. Hsiao and D. Kao, "The static analysis of WannaCry ransomware," Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), pp. 153-158, 2018.
[13] K. Bakour, H. M. Ünver and R. Ghanem, "The Android Malware Static Analysis: Techniques, Limitations, and Open Challenges," Proceedings of the 2018 3rd International Conference on Computer Science and Engineering (UBMK), pp. 586-593, 2018.
[14] A. Afianian, S. Niksefat, B. Sadeghiyan, and D. Baptiste, "Malware Dynamic Analysis Evasion Techniques: A Survey.," ACM Computing Surveys, vol. 52, no. 6, pp. 1–28, 2020.
[15] Any.Run, "ANY.RUN - Interactive Online Malware Sandbox," Available: https://any.run/ [Accessed: 30-Apr-2021].
[16] x64dbg, “x64dbg,” Available: https://x64dbg.com/ [Accessed: 30-Apr-2021].
[17] R. Tahir, "A Study on Malware and Malware Detection Techniques," International Journal of Education and Management Engineering, vol. 8, no. 2, pp. 20–30, 2018.
[18] C. H. Kim, K. E. Kamundala and S. Kang, "Efficiency-Based Comparison on Malware Detection Techniques," Proceedings of the 2018 International Conference on Platform Technology and Service, pp. 1-6, 2018.
[19] J. Kornblum, "Identifying almost identical files using context triggered piecewise hashing," Digital Investigation, vol. 3, pp. 91–97, 2006.
[20] J. Oliver, C. Cheng and Y. Chen, "TLSH -- A Locality Sensitive Hash," Proceedings of the 2013 Fourth Cybercrime and Trustworthy Computing Workshop, pp. 7-13, 2013.
[21] V. Roussev, "Data Fingerprinting with Similarity Digests," Advances in Digital Forensics VI, pp. 207–226, 2010.
[22] P. Black, I. Gondal, P. Vamplew and A. Lakhotia, "Evolved Similarity Techniques in Malware Analysis," Proceedings of the 2019 18th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/13th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), pp. 404-410, 2019.
[23] R. Sihwail, K. Omar, and K. A. Zainol Ariffin, "A Survey on Malware Analysis Techniques: Static, Dynamic, Hybrid and Memory Analysis," International Journal on Advanced Science, Engineering and Information Technology, vol. 8, no. 4-2, p. 1662, 2018.
[24] B. Ndibanje, K. Kim, Y. Kang, H. Kim, T. Kim, and H. Lee, "Cross-Method-Based Analysis and Classification of Malicious Behavior by API Calls Extraction," Applied Sciences, vol. 9, no. 2, p. 239, 2019.
[25] Y. Fang, W. Zhang, B. Li, F. Jing, and L. Zhang, "Semi-supervised malware clustering based on the weight of bytecode and API," IEEE Access, vol. 8, pp. 2313–2326, 2020.
[26] W. Han, J. Xue, Y. Wang, L. Huang, Z. Kong, and L. Mao, "MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics," Computers & Security, vol. 83, pp. 208–233, 2019.
[27] R. Taheri, M. Ghahramani, R. Javidan, M. Shojafar, Z. Pooranian, and M. Conti, "Similarity-based Android malware detection using Hamming distance of static binary features," Future Generation Computer Systems, vol. 105, pp. 230–247, 2020.
[28] B.L. Zhao, F.D. Liu, Z. Shan, Y.H. Chen, and J. Liu, "Graph similarity metric using graph convolutional network: Application to malware similarity match," Proceedings of the IEICE TRANSACTIONS on Information and Systems, pp. 1581–1585, 2019.
[29] T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," arXiv: 1609.02907 [cs.LG], 2016.
[30] M. Fan et al., "Graph Embedding Based Familial Analysis of Android Malware using Unsupervised Learning," Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 771-782, 2019.
[31] F. N. Ducau, E. M. Rudd, T. M. Heppner, A. Long, and K. Berlin, "Automatic malware description via attribute tagging and similarity embedding," arXiv: 1905.06262 [cs.LG], 2019.
[32] D. Vasan, M. Alazab, S. Wassan, H. Naeem, B. Safaei, and Q. Zheng, "IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture," Computer Networks, vol. 171, no. 107138, p. 107138, 2020.
[33] Z. Cui, F. Xue, X. Cai, Y. Cao, G. Wang and J. Chen, "Detection of Malicious Code Variants Based on Deep Learning," IEEE Transactions on Industrial Informatics, vol. 14, no. 7, pp. 3187-3196, 2018.
[34] S.-C. Hsiao, D.-Y. Kao, Z.-Y. Liu, and R. Tso, "Malware Image Classification Using One-Shot Learning with Siamese Networks," Procedia Computer Science, vol. 159, pp. 1863–1871, 2019.
[35] D. Vasan, M. Alazab, S. Wassan, B. Safaei, and Q. Zheng, "Image-Based malware classification using ensemble of CNN architectures (IMCEC)" Computers & Security, vol. 92, p. 101748, 2020.
[36] D. Wang, H. Shu, F. Kang and W. Bu, "A Malware Similarity Analysis Method Based on Network Control Structure Graph," Proceedings of the 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS), pp. 295-300, 2020.
[37] E. Amer and I. Zelinka, "A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence," Computers & Security, vol. 92, no. 101760, p. 101760, 2020.
[38] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, "Signature verification using a "Siamese" time delay neural network," Proceedings of the 6th International Conference on Neural Information Processing Systems (NIPS'93), pp. 737-744, 1993.
[39] S. H. H. Ding, B. C. M. Fung, and P. Charland, "Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization," Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), pp. 472-489, 2019.
[40] Q. V. Le and T. Mikolov, "Distributed representations of sentences and documents," arXiv: 1405.4053 [cs.CL], pp. II-1188-II–1196, 2014.
[41] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, "How Powerful are Graph Neural Networks?," arXiv: 1810.00826 [cs.LG], 2018.
[42] M. Fey and J. E. Lenssen, "Fast graph representation learning with PyTorch Geometric," arXiv: 1903.02428 [cs.LG], 2019.
[43] W. L. Hamilton, R. Ying, and J. Leskovec, "Inductive representation learning on large graphs," arXiv:1706.02216 [cs.SI], 2017.
[44] O. Vinyals, S. Bengio, and M. Kudlur, "Order Matters: Sequence to sequence for sets," arXiv: 1511.06391 [stat.ML], 2015.
[45] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with Neural Networks," arXiv: 1409.3215 [cs.CL], 2014.
[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," arXiv: 1706.03762 [cs.CL], 2017.
[47] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[48] R. Hadsell, S. Chopra and Y. LeCun, "Dimensionality Reduction by Learning an Invariant Mapping," Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), pp. 1735-1742, 2006.
[49] J. Oliver, S. Forman, and C. Cheng, "Using Randomization to Attack Similarity Digests," Proceedings of the Applications and Techniques in Information Security, pp. 199–210, 2014.

全文公開日期 2023/08/02 (校內網路)
全文公開日期 2024/08/02 (校外網路)
全文公開日期 2025/08/02 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文