ByteBERT：透過位元組序列預訓練語言模型之物聯網惡意軟體檢測

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳彥合 Yen-Ho Chen
論文名稱：	ByteBERT：透過位元組序列預訓練語言模型之物聯網惡意軟體檢測 ByteBERT: A Pre-trained Language Model for IoT Malware Detection Using Byte Sequences
指導教授：	鄭欣明 Shin-Ming Cheng
口試委員:	李漢銘游家牧王紹睿鄭欣明
學位類別：	碩士 Master
系所名稱：	產學創新學院 - 人工智慧跨域科技研究所 A.I. Cross-disciplinary Tech
論文出版年：	2024
畢業學年度：	112
語文別：	英文
論文頁數：	37
中文關鍵詞：	物聯網惡意軟體、惡意軟體檢測、自然語言處理、預訓練語言模型、深度學習
外文關鍵詞：	IoT Malware, Malware Detection, NLP, Pre-trained Language Model, Deep Learning
相關次數：	點閱：785 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

預訓練語言模型在惡意軟體檢測領域展現了卓越潛力。然而，基於預訓練語言模型的方法通常依賴於逆向工程來提取高級特徵（例如，操作碼）。這種依賴不僅耗時，而且容易受到反逆向工程技術和剝離二進制文件的影響，導致檢測不可靠。為了解決這些挑戰，同時保留預訓練語言模型的優勢，我們提出了 ByteBERT，一種基於位元組的新型預訓練語言模型，旨在從位元組序列中提取深層語義特徵，而無需進行逆向工程。通過修改 BERT 的架構，並利用遮蔽語言模型任務進行預訓練，ByteBERT 增強了對位元組上下文的理解。預訓練後的 ByteBERT 隨後被微調於惡意軟體檢測和分類。通過實驗，我們發現 ByteBERT 超越了現有的基於預訓練語言模型和基於二進制特徵的方法，提高了惡意軟體檢測和分類的性能。我們的主要貢獻包括開發 ByteBERT 模型以進行字節序列預訓練，將其微調用於惡意軟體檢測和分類，並實驗性地確認了其效能。這項研究表明，ByteBERT 為物聯網惡意軟體檢測提供了一種高效且可靠的解決方案。

PLMs have demonstrated significant potential in the field of malware detection. However, methods based on these models often depend on reverse engineering to extract high-level features such as Opcodes. This dependency is not only time-intensive but also susceptible to anti-reverse engineering strategies and stripped binary files, leading to unreliable detection. To address these challenges while maintaining the benefits of PLMs, we propose ByteBERT, a novel PLM that directly extracts deep semantic features from raw bytes, eliminating the need for reverse engineering. By adapting the BERT architecture and employing a masked language model task during pre-training, ByteBERT improves its contextual understanding of byte sequences. The pre-trained ByteBERT is then fine-tuned for malware detection and classification. Through experimentation, we validated our approach and found that ByteBERT surpasses current PLM-based and binary-based feature methods, enhancing both malware detection and classification performance. Our main contributions include the development of the ByteBERT model for byte sequence pre-training, fine-tuning it for malware detection and classification, and experimentally confirming its efficacy. This research shows that ByteBERT provides an efficient and dependable solution for IoT malware detection.

中文摘要 i
ABSTRACT ii
誌謝 iii
Introduction 1
1 Background 1
2 Motivation 2
3 Challenges and Goals 4
4 Contributions 5
5 Outline of the Thesis 6
Related Work 7
1 Pre-trained Language Model (PLM) 7
2 PLM-based Methods 9
3 Binary-based Feature Methods 11
Methodology 13
1 Pre-train: ByteBERT 15
1.1 Model Architecture 15
1.2 Masked Language Modeling 16
2 Downstream Task Fine-tuning 17
2.1 Malware Detection 17
2.2 Malware Family Classification 18
Experimental Setup 19
1 Datasets 19
2 Model Architecture Modification 20
3 Pre-train Phase 20
4 Fine-tune Phase 20
5 Methods for Comparison 21
Evaluation 22
1 RQ 1. Can ByteBERT understand bytes better than Original BERT for IoT malware detection? 22
2 RQ 2. How does ByteBERT perform in malware detection compared to other PLM-based methods? 24
3 RQ 3. How does ByteBERT perform in malware detection compared to other binary-based feature methods? 25
Limitations & Future Work 27
1 Limitations 27
2 Future Work 28
Conclusions 30
                                

[1] A. D. Raju, I. Y. Abualhaol, R. S. Giagone, Y. Zhou, and S. Huang, “A survey on cross-architectural IoT malware threat hunting,” IEEE Access, vol. 9, pp. 91 686–91 709, Jun. 2021.
[2] A. D. Jurcut, P. Ranaweera, and L. Xu, “Introduction to IoT security,” IoT Security: Advances in Authentication, pp. 27–64, 2020.
[3] S. Talukder and Z. Talukder, “A survey on malware detection and analysis tools,” IJNSA, vol. 12, 2020.
[4] Q.-D. Ngo, H.-T. Nguyen, V.-H. Le, and D.-H. Nguyen, “A survey of IoT malware and detection methods based on static features,” ICT Express, vol. 6, no. 4, pp. 280–286, 2020.
[5] F. Shahzad and M. Farooq, “Elf-miner: Using structural knowledge and data mining methods to detect new (Linux) malicious executables,” Knowledge and Information Systems, vol. 30, pp. 589–612, 2012.
[6] D. Vasan, M. Alazab, S. Wassan, H. Naeem, B. Safaei, and Q. Zheng, “IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture,” Computer Networks, vol. 171, p. 107138, 2020.
[7] A. Ravi, V. Chaturvedi, and M. Shafique, “ViT4Mal: Lightweight vision transformer for malware detection on edge devices,” ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–26, 2023.
[8] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, and C. K. Nicholas, “Malware detection by eating a whole exe,” in Proc. AAAI 2018, 2018.
[9] R. Chaganti, V. Ravi, and T. D. Pham, “Deep learning based cross architecture internet of things malware detection and classification,” Computers & Security, vol. 120, p. 102779, 2022.
[10] T.-L. Wan, T. Ban, S.-M. Cheng, Y.-T. Lee, B. Sun, R. Isawa, T. Takahashi, and D. Inoue, “Efficient detection and classification of internet-of-things malware based on byte sequences from executable files,” IEEE Open Journal of the Computer Society, vol. 1, pp. 262–275, 2020.
[11] S. A. Hamad, Q. Z. Sheng, and W. E. Zhang, “BERTDeep-Ware: A cross-architecture malware detection solution for IoT systems,” in Proc. IEEE TrustCom 2021, 2021, pp. 927–934.
[12] B. Wu, Y. Xu, and F. Zou, “Malware classification by learning semantic and structural features of control flow graphs,” in Proc. IEEE TrustCom 2021, 2021, pp. 540–547.
[13] C. Li, G. Shen, and W. Sun, “Cross-architecture internet-of-things malware detection based on graph neural network,” in Proc. IJCNN 2021, 2021, pp. 1–7.
[14] A. S. Kale, V. Pandya, F. D. Troia, and M. Stamp, “Malware classification with word2vec, hmm2vec, bert, and elmo,” Journal of Computer Virology and Hacking Techniques, vol. 19, no. 1, pp. 1–16, 2023.
[15] P. Kunwar, K. Aryal, M. Gupta, M. Abdelsalam, and E. Bertino, “SoK: Leveraging transformers for malware analysis,” arXiv preprint arXiv:2405.17190, 2024.
[16] Z. Liu, “A review of advancements and applications of pre-trained language models in cybersecurity,” in Proc. IEEE ISDFS 2024, 2024, pp. 1–10.
[17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[18] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, Jan. 2020.
[20] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[21] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[23] “Executable and linking format (ELF) specification version 1.2,” Tool Interface Standard (TIS), (1995, May). [Online]. Available: https://refspecs.linuxbase.org/elf/elf.pdf
[24] S. Choi, T. Chang, S.-W. Yoon, and Y. Park, “Hybrid emulation for bypassing anti-reversing techniques and analyzing malware,” The Journal of Supercomputing, vol. 77, no. 1, pp. 471–497, 2021.
[25] B. Singh and H. Joseph, Vulnerability Analysis and Defense for the Internet. Springer Science & Business Media, 2008, vol. 37.
[26] X. Jin, K. Pei, J. Y. Won, and Z. Lin, “Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings,” in Proc. of the ACM CCS 2022, 2022, pp. 1631–1645.
[27] J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proc. of the ACM CCS 2018, 2018, pp. 1667–1680.
[28] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
[29] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859, 2021.
[30] J. Xiong, G. Chen, K. Chen, H. Gao, S. Cheng, and W. Zhang, “HexT5: Unified pre-training for stripped binary code information inference,” in Proc. IEEE/ACM ASE 2023, 2023, pp. 774–786.
[31] A. Al-Kaswan, T. Ahmed, M. Izadi, A. A. Sawant, P. Devanbu, and A. van Deursen, “Extending source code pre-trained language models to summarise decompiled binaries,” in Proc. IEEE SANER 2023, 2023, pp. 260–271.
[32] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
[33] Y. Xiao, S. Ahmed, X. Ge, B. Viswanath, N. Meng, and D. Yao, “Comprehensive comparisons of embedding approaches for cryptographic API completion,” in Proc. ACM/IEEE ICSE 2022, 2022, pp. 360–361.
[34] X. Jin, J. Larson, W. Yang, and Z. Lin, “Binary code summarization: Benchmarking chatgpt/ gpt-4 and other large language models,” arXiv preprint arXiv:2312.09601, 2023.
[35] “ChatGPT,” https://openai.com/chatgpt/.
[36] Z. Ding, H. Xu, Y. Guo, L. Yan, L. Cui, and Z. Hao, “Mal-Bert-GCN: Malware detection by combining Bert and GCN,” in Proc. IEEE TrustCom 2022, 2022, pp. 175–183.
[37] A. Rahali and M. A. Akhloufi, “Malbert: Malware detection using bidirectional encoder representations from transformers,” in Proc. IEEE SMC 2021, 2021, pp. 3226–3231.
[38] D. Demirci and C. Acarturk, “Static malware detection using stacked bilstm and gpt-2,” IEEE Access, vol. 10, pp. 58 488–58 502, 2022.
[39] R. Jones, M. Omar, D. Mohammed, C. Nobels, and M. Dawson, “IoT malware detection with GPT models,” in Proc. IEEE CSCE 2023, 2023, pp. 1749–1752.
[40] “PyTorch,” https://pytorch.org.
[41] “Huggingface,” https://huggingface.co.
[42] “VirusShare,” https://virusshare.com/.
[43] “VirusTotal,” https://www.virustotal.com.
[44] “AVClass,” https://github.com/malicialab/avclass.

全文公開日期 2026/08/13 (校內網路)
全文公開日期 2029/08/13 (校外網路)
全文公開日期 2029/08/13 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文