簡易檢索 / 詳目顯示

研究生: 陳彥合
Yen-Ho Chen
論文名稱: ByteBERT:透過位元組序列預訓練語言模型之物聯網惡意軟體檢測
ByteBERT: A Pre-trained Language Model for IoT Malware Detection Using Byte Sequences
指導教授: 鄭欣明
Shin-Ming Cheng
口試委員: 李漢銘
游家牧
王紹睿
鄭欣明
學位類別: 碩士
Master
系所名稱: 產學創新學院 - 人工智慧跨域科技研究所
A.I. Cross-disciplinary Tech
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 37
中文關鍵詞: 物聯網惡意軟體惡意軟體檢測自然語言處理預訓練語言模型深度學習
外文關鍵詞: IoT Malware, Malware Detection, NLP, Pre-trained Language Model, Deep Learning
相關次數: 點閱:785下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

預訓練語言模型在惡意軟體檢測領域展現了卓越潛力。然而,基於預訓練 語言模型的方法通常依賴於逆向工程來提取高級特徵(例如,操作碼)。這種 依賴不僅耗時,而且容易受到反逆向工程技術和剝離二進制文件的影響,導致 檢測不可靠。為了解決這些挑戰,同時保留預訓練語言模型的優勢,我們提 出了 ByteBERT,一種基於位元組的新型預訓練語言模型,旨在從位元組序列 中提取深層語義特徵,而無需進行逆向工程。通過修改 BERT 的架構,並利用 遮蔽語言模型任務進行預訓練,ByteBERT 增強了對位元組上下文的理解。預 訓練後的 ByteBERT 隨後被微調於惡意軟體檢測和分類。通過實驗,我們發現 ByteBERT 超越了現有的基於預訓練語言模型和基於二進制特徵的方法,提高 了惡意軟體檢測和分類的性能。我們的主要貢獻包括開發 ByteBERT 模型以進 行字節序列預訓練,將其微調用於惡意軟體檢測和分類,並實驗性地確認了其 效能。這項研究表明,ByteBERT 為物聯網惡意軟體檢測提供了一種高效且可 靠的解決方案。


PLMs have demonstrated significant potential in the field of malware detection. However, methods based on these models often depend on reverse engineering to extract high-level features such as Opcodes. This dependency is not only time-intensive but also susceptible to anti-reverse engineering strategies and stripped binary files, leading to unreliable detection. To address these challenges while maintaining the benefits of PLMs, we propose ByteBERT, a novel PLM that directly extracts deep semantic features from raw bytes, eliminating the need for reverse engineering. By adapting the BERT architecture and employing a masked language model task during pre-training, ByteBERT improves its contextual understanding of byte sequences. The pre-trained ByteBERT is then fine-tuned for malware detection and classification. Through experimentation, we validated our approach and found that ByteBERT surpasses current PLM-based and binary-based feature methods, enhancing both malware detection and classification performance. Our main contributions include the development of the ByteBERT model for byte sequence pre-training, fine-tuning it for malware detection and classification, and experimentally confirming its efficacy. This research shows that ByteBERT provides an efficient and dependable solution for IoT malware detection.

中文摘要 i ABSTRACT ii 誌謝 iii 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Challenges and Goals 4 1.4 Contributions 5 1.5 Outline of the Thesis 6 2 Related Work 7 2.1 Pre-trained Language Model (PLM) 7 2.2 PLM-based Methods 9 2.3 Binary-based Feature Methods 11 3 Methodology 13 3.1 Pre-train: ByteBERT 15 3.1.1 Model Architecture 15 3.1.2 Masked Language Modeling 16 3.2 Downstream Task Fine-tuning 17 3.2.1 Malware Detection 17 3.2.2 Malware Family Classification 18 4 Experimental Setup 19 4.1 Datasets 19 4.2 Model Architecture Modification 20 4.3 Pre-train Phase 20 4.4 Fine-tune Phase 20 4.5 Methods for Comparison 21 5 Evaluation 22 5.1 RQ 1. Can ByteBERT understand bytes better than Original BERT for IoT malware detection? 22 5.2 RQ 2. How does ByteBERT perform in malware detection compared to other PLM-based methods? 24 5.3 RQ 3. How does ByteBERT perform in malware detection compared to other binary-based feature methods? 25 6 Limitations & Future Work 27 6.1 Limitations 27 6.2 Future Work 28 7 Conclusions 30

[1] A. D. Raju, I. Y. Abualhaol, R. S. Giagone, Y. Zhou, and S. Huang, “A survey on cross-architectural IoT malware threat hunting,” IEEE Access, vol. 9, pp. 91 686–91 709, Jun. 2021.
[2] A. D. Jurcut, P. Ranaweera, and L. Xu, “Introduction to IoT security,” IoT Security: Advances in Authentication, pp. 27–64, 2020.
[3] S. Talukder and Z. Talukder, “A survey on malware detection and analysis tools,” IJNSA, vol. 12, 2020.
[4] Q.-D. Ngo, H.-T. Nguyen, V.-H. Le, and D.-H. Nguyen, “A survey of IoT malware and detection methods based on static features,” ICT Express, vol. 6, no. 4, pp. 280–286, 2020.
[5] F. Shahzad and M. Farooq, “Elf-miner: Using structural knowledge and data mining methods to detect new (Linux) malicious executables,” Knowledge and Information Systems, vol. 30, pp. 589–612, 2012.
[6] D. Vasan, M. Alazab, S. Wassan, H. Naeem, B. Safaei, and Q. Zheng, “IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture,” Computer Networks, vol. 171, p. 107138, 2020.
[7] A. Ravi, V. Chaturvedi, and M. Shafique, “ViT4Mal: Lightweight vision transformer for malware detection on edge devices,” ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–26, 2023.
[8] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, and C. K. Nicholas, “Malware detection by eating a whole exe,” in Proc. AAAI 2018, 2018.
[9] R. Chaganti, V. Ravi, and T. D. Pham, “Deep learning based cross architecture internet of things malware detection and classification,” Computers & Security, vol. 120, p. 102779, 2022.
[10] T.-L. Wan, T. Ban, S.-M. Cheng, Y.-T. Lee, B. Sun, R. Isawa, T. Takahashi, and D. Inoue, “Efficient detection and classification of internet-of-things malware based on byte sequences from executable files,” IEEE Open Journal of the Computer Society, vol. 1, pp. 262–275, 2020.
[11] S. A. Hamad, Q. Z. Sheng, and W. E. Zhang, “BERTDeep-Ware: A cross-architecture malware detection solution for IoT systems,” in Proc. IEEE TrustCom 2021, 2021, pp. 927–934.
[12] B. Wu, Y. Xu, and F. Zou, “Malware classification by learning semantic and structural features of control flow graphs,” in Proc. IEEE TrustCom 2021, 2021, pp. 540–547.
[13] C. Li, G. Shen, and W. Sun, “Cross-architecture internet-of-things malware detection based on graph neural network,” in Proc. IJCNN 2021, 2021, pp. 1–7.
[14] A. S. Kale, V. Pandya, F. D. Troia, and M. Stamp, “Malware classification with word2vec, hmm2vec, bert, and elmo,” Journal of Computer Virology and Hacking Techniques, vol. 19, no. 1, pp. 1–16, 2023.
[15] P. Kunwar, K. Aryal, M. Gupta, M. Abdelsalam, and E. Bertino, “SoK: Leveraging transformers for malware analysis,” arXiv preprint arXiv:2405.17190, 2024.
[16] Z. Liu, “A review of advancements and applications of pre-trained language models in cybersecurity,” in Proc. IEEE ISDFS 2024, 2024, pp. 1–10.
[17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[18] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, Jan. 2020.
[20] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[21] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[23] “Executable and linking format (ELF) specification version 1.2,” Tool Interface Standard (TIS), (1995, May). [Online]. Available: https://refspecs.linuxbase.org/elf/elf.pdf
[24] S. Choi, T. Chang, S.-W. Yoon, and Y. Park, “Hybrid emulation for bypassing anti-reversing techniques and analyzing malware,” The Journal of Supercomputing, vol. 77, no. 1, pp. 471–497, 2021.
[25] B. Singh and H. Joseph, Vulnerability Analysis and Defense for the Internet. Springer Science & Business Media, 2008, vol. 37.
[26] X. Jin, K. Pei, J. Y. Won, and Z. Lin, “Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings,” in Proc. of the ACM CCS 2022, 2022, pp. 1631–1645.
[27] J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proc. of the ACM CCS 2018, 2018, pp. 1667–1680.
[28] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
[29] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859, 2021.
[30] J. Xiong, G. Chen, K. Chen, H. Gao, S. Cheng, and W. Zhang, “HexT5: Unified pre-training for stripped binary code information inference,” in Proc. IEEE/ACM ASE 2023, 2023, pp. 774–786.
[31] A. Al-Kaswan, T. Ahmed, M. Izadi, A. A. Sawant, P. Devanbu, and A. van Deursen, “Extending source code pre-trained language models to summarise decompiled binaries,” in Proc. IEEE SANER 2023, 2023, pp. 260–271.
[32] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
[33] Y. Xiao, S. Ahmed, X. Ge, B. Viswanath, N. Meng, and D. Yao, “Comprehensive comparisons of embedding approaches for cryptographic API completion,” in Proc. ACM/IEEE ICSE 2022, 2022, pp. 360–361.
[34] X. Jin, J. Larson, W. Yang, and Z. Lin, “Binary code summarization: Benchmarking chatgpt/ gpt-4 and other large language models,” arXiv preprint arXiv:2312.09601, 2023.
[35] “ChatGPT,” https://openai.com/chatgpt/.
[36] Z. Ding, H. Xu, Y. Guo, L. Yan, L. Cui, and Z. Hao, “Mal-Bert-GCN: Malware detection by combining Bert and GCN,” in Proc. IEEE TrustCom 2022, 2022, pp. 175–183.
[37] A. Rahali and M. A. Akhloufi, “Malbert: Malware detection using bidirectional encoder representations from transformers,” in Proc. IEEE SMC 2021, 2021, pp. 3226–3231.
[38] D. Demirci and C. Acarturk, “Static malware detection using stacked bilstm and gpt-2,” IEEE Access, vol. 10, pp. 58 488–58 502, 2022.
[39] R. Jones, M. Omar, D. Mohammed, C. Nobels, and M. Dawson, “IoT malware detection with GPT models,” in Proc. IEEE CSCE 2023, 2023, pp. 1749–1752.
[40] “PyTorch,” https://pytorch.org.
[41] “Huggingface,” https://huggingface.co.
[42] “VirusShare,” https://virusshare.com/.
[43] “VirusTotal,” https://www.virustotal.com.
[44] “AVClass,” https://github.com/malicialab/avclass.

無法下載圖示 全文公開日期 2026/08/13 (校內網路)
全文公開日期 2029/08/13 (校外網路)
全文公開日期 2029/08/13 (國家圖書館:臺灣博碩士論文系統)
QR CODE