研究生: |
陳彥合 Yen-Ho Chen |
---|---|
論文名稱: |
ByteBERT:透過位元組序列預訓練語言模型之物聯網惡意軟體檢測 ByteBERT: A Pre-trained Language Model for IoT Malware Detection Using Byte Sequences |
指導教授: |
鄭欣明
Shin-Ming Cheng |
口試委員: |
李漢銘
游家牧 王紹睿 鄭欣明 |
學位類別: |
碩士 Master |
系所名稱: |
產學創新學院 - 人工智慧跨域科技研究所 A.I. Cross-disciplinary Tech |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 37 |
中文關鍵詞: | 物聯網惡意軟體 、惡意軟體檢測 、自然語言處理 、預訓練語言模型 、深度學習 |
外文關鍵詞: | IoT Malware, Malware Detection, NLP, Pre-trained Language Model, Deep Learning |
相關次數: | 點閱:785 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
預訓練語言模型在惡意軟體檢測領域展現了卓越潛力。然而,基於預訓練 語言模型的方法通常依賴於逆向工程來提取高級特徵(例如,操作碼)。這種 依賴不僅耗時,而且容易受到反逆向工程技術和剝離二進制文件的影響,導致 檢測不可靠。為了解決這些挑戰,同時保留預訓練語言模型的優勢,我們提 出了 ByteBERT,一種基於位元組的新型預訓練語言模型,旨在從位元組序列 中提取深層語義特徵,而無需進行逆向工程。通過修改 BERT 的架構,並利用 遮蔽語言模型任務進行預訓練,ByteBERT 增強了對位元組上下文的理解。預 訓練後的 ByteBERT 隨後被微調於惡意軟體檢測和分類。通過實驗,我們發現 ByteBERT 超越了現有的基於預訓練語言模型和基於二進制特徵的方法,提高 了惡意軟體檢測和分類的性能。我們的主要貢獻包括開發 ByteBERT 模型以進 行字節序列預訓練,將其微調用於惡意軟體檢測和分類,並實驗性地確認了其 效能。這項研究表明,ByteBERT 為物聯網惡意軟體檢測提供了一種高效且可 靠的解決方案。
PLMs have demonstrated significant potential in the field of malware detection. However, methods based on these models often depend on reverse engineering to extract high-level features such as Opcodes. This dependency is not only time-intensive but also susceptible to anti-reverse engineering strategies and stripped binary files, leading to unreliable detection. To address these challenges while maintaining the benefits of PLMs, we propose ByteBERT, a novel PLM that directly extracts deep semantic features from raw bytes, eliminating the need for reverse engineering. By adapting the BERT architecture and employing a masked language model task during pre-training, ByteBERT improves its contextual understanding of byte sequences. The pre-trained ByteBERT is then fine-tuned for malware detection and classification. Through experimentation, we validated our approach and found that ByteBERT surpasses current PLM-based and binary-based feature methods, enhancing both malware detection and classification performance. Our main contributions include the development of the ByteBERT model for byte sequence pre-training, fine-tuning it for malware detection and classification, and experimentally confirming its efficacy. This research shows that ByteBERT provides an efficient and dependable solution for IoT malware detection.
[1] A. D. Raju, I. Y. Abualhaol, R. S. Giagone, Y. Zhou, and S. Huang, “A survey on cross-architectural IoT malware threat hunting,” IEEE Access, vol. 9, pp. 91 686–91 709, Jun. 2021.
[2] A. D. Jurcut, P. Ranaweera, and L. Xu, “Introduction to IoT security,” IoT Security: Advances in Authentication, pp. 27–64, 2020.
[3] S. Talukder and Z. Talukder, “A survey on malware detection and analysis tools,” IJNSA, vol. 12, 2020.
[4] Q.-D. Ngo, H.-T. Nguyen, V.-H. Le, and D.-H. Nguyen, “A survey of IoT malware and detection methods based on static features,” ICT Express, vol. 6, no. 4, pp. 280–286, 2020.
[5] F. Shahzad and M. Farooq, “Elf-miner: Using structural knowledge and data mining methods to detect new (Linux) malicious executables,” Knowledge and Information Systems, vol. 30, pp. 589–612, 2012.
[6] D. Vasan, M. Alazab, S. Wassan, H. Naeem, B. Safaei, and Q. Zheng, “IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture,” Computer Networks, vol. 171, p. 107138, 2020.
[7] A. Ravi, V. Chaturvedi, and M. Shafique, “ViT4Mal: Lightweight vision transformer for malware detection on edge devices,” ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–26, 2023.
[8] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, and C. K. Nicholas, “Malware detection by eating a whole exe,” in Proc. AAAI 2018, 2018.
[9] R. Chaganti, V. Ravi, and T. D. Pham, “Deep learning based cross architecture internet of things malware detection and classification,” Computers & Security, vol. 120, p. 102779, 2022.
[10] T.-L. Wan, T. Ban, S.-M. Cheng, Y.-T. Lee, B. Sun, R. Isawa, T. Takahashi, and D. Inoue, “Efficient detection and classification of internet-of-things malware based on byte sequences from executable files,” IEEE Open Journal of the Computer Society, vol. 1, pp. 262–275, 2020.
[11] S. A. Hamad, Q. Z. Sheng, and W. E. Zhang, “BERTDeep-Ware: A cross-architecture malware detection solution for IoT systems,” in Proc. IEEE TrustCom 2021, 2021, pp. 927–934.
[12] B. Wu, Y. Xu, and F. Zou, “Malware classification by learning semantic and structural features of control flow graphs,” in Proc. IEEE TrustCom 2021, 2021, pp. 540–547.
[13] C. Li, G. Shen, and W. Sun, “Cross-architecture internet-of-things malware detection based on graph neural network,” in Proc. IJCNN 2021, 2021, pp. 1–7.
[14] A. S. Kale, V. Pandya, F. D. Troia, and M. Stamp, “Malware classification with word2vec, hmm2vec, bert, and elmo,” Journal of Computer Virology and Hacking Techniques, vol. 19, no. 1, pp. 1–16, 2023.
[15] P. Kunwar, K. Aryal, M. Gupta, M. Abdelsalam, and E. Bertino, “SoK: Leveraging transformers for malware analysis,” arXiv preprint arXiv:2405.17190, 2024.
[16] Z. Liu, “A review of advancements and applications of pre-trained language models in cybersecurity,” in Proc. IEEE ISDFS 2024, 2024, pp. 1–10.
[17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[18] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, Jan. 2020.
[20] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[21] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[23] “Executable and linking format (ELF) specification version 1.2,” Tool Interface Standard (TIS), (1995, May). [Online]. Available: https://refspecs.linuxbase.org/elf/elf.pdf
[24] S. Choi, T. Chang, S.-W. Yoon, and Y. Park, “Hybrid emulation for bypassing anti-reversing techniques and analyzing malware,” The Journal of Supercomputing, vol. 77, no. 1, pp. 471–497, 2021.
[25] B. Singh and H. Joseph, Vulnerability Analysis and Defense for the Internet. Springer Science & Business Media, 2008, vol. 37.
[26] X. Jin, K. Pei, J. Y. Won, and Z. Lin, “Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings,” in Proc. of the ACM CCS 2022, 2022, pp. 1631–1645.
[27] J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proc. of the ACM CCS 2018, 2018, pp. 1667–1680.
[28] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
[29] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859, 2021.
[30] J. Xiong, G. Chen, K. Chen, H. Gao, S. Cheng, and W. Zhang, “HexT5: Unified pre-training for stripped binary code information inference,” in Proc. IEEE/ACM ASE 2023, 2023, pp. 774–786.
[31] A. Al-Kaswan, T. Ahmed, M. Izadi, A. A. Sawant, P. Devanbu, and A. van Deursen, “Extending source code pre-trained language models to summarise decompiled binaries,” in Proc. IEEE SANER 2023, 2023, pp. 260–271.
[32] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
[33] Y. Xiao, S. Ahmed, X. Ge, B. Viswanath, N. Meng, and D. Yao, “Comprehensive comparisons of embedding approaches for cryptographic API completion,” in Proc. ACM/IEEE ICSE 2022, 2022, pp. 360–361.
[34] X. Jin, J. Larson, W. Yang, and Z. Lin, “Binary code summarization: Benchmarking chatgpt/ gpt-4 and other large language models,” arXiv preprint arXiv:2312.09601, 2023.
[35] “ChatGPT,” https://openai.com/chatgpt/.
[36] Z. Ding, H. Xu, Y. Guo, L. Yan, L. Cui, and Z. Hao, “Mal-Bert-GCN: Malware detection by combining Bert and GCN,” in Proc. IEEE TrustCom 2022, 2022, pp. 175–183.
[37] A. Rahali and M. A. Akhloufi, “Malbert: Malware detection using bidirectional encoder representations from transformers,” in Proc. IEEE SMC 2021, 2021, pp. 3226–3231.
[38] D. Demirci and C. Acarturk, “Static malware detection using stacked bilstm and gpt-2,” IEEE Access, vol. 10, pp. 58 488–58 502, 2022.
[39] R. Jones, M. Omar, D. Mohammed, C. Nobels, and M. Dawson, “IoT malware detection with GPT models,” in Proc. IEEE CSCE 2023, 2023, pp. 1749–1752.
[40] “PyTorch,” https://pytorch.org.
[41] “Huggingface,” https://huggingface.co.
[42] “VirusShare,” https://virusshare.com/.
[43] “VirusTotal,” https://www.virustotal.com.
[44] “AVClass,” https://github.com/malicialab/avclass.