簡易檢索 / 詳目顯示

研究生: 林志達
Chih-ta Lin
論文名稱: 惡意程式高維動態行為特徵選取與降維分析
An Efficient Feature Selection and Extraction Analysis for Malware Behavior Classification
指導教授: 王乃堅
Nai-Jian Wang
口試委員: 黃彥男
Yen-nun HUANG
李漢銘
Hahn-Ming Lee
陳孟彰
Meng-Chang Chen
呂學坤
Shyue-Kung Lu
陳俊良
Jiann-Liang Chen
洪士灝
Shih-Hao Hung
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 英文
論文頁數: 114
中文關鍵詞: 惡意程式動態分析分類學習數據降維特徵選擇特徵萃取
外文關鍵詞: Dynamic Malware Analysis, Data Classification, Dimensionality Reduction, Feature Selection, Feature Extraction
相關次數: 點閱:236下載:17
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 每年爆量惡意程式(malware)成長驚人持續在網絡和操作系統造成新型態的威脅。傳統利用病毒靜態檔案特徵碼進行已知惡意程式比對偵測,已經無法應付駭客各種大量變種與新型匿藏手法之惡意程式。駭客惡意程式雖然容易迴避掃毒軟體偵測,但其動態程式行為仍充分揭露其意圖特徵,可有效察覺惡意行為,並利用機器學習方式進行預測以提高偵測率。但病毒多樣性與大量行為特徵行為及樣品數,對於惡意程式之分類訓練與學習分析是件耗資源費時的工作。
    本論文提出一個通用且有效率的方法來分析預測每一種惡意程式之行為,本文的方法結合特徵選取與萃取方法,在特徵提取階段先選取有效特徵,然後進行特徵降維,二段式方法大量降低特徵空間之維度,然後建立分類學習模型。本研究經過從一沙箱環境紀錄待測程式各項系統、網路及登錄行為紀錄後,以下列五步驟進行特徵降維與分類學習預測:(一)從紀錄檔提取呼叫函式n-gram 特徵文本數據、(二)以SVM 方法建立惡意程式分類學習器、(三)TF-IDF方式選取有效特徵組、(四)以PCA與KPCA等方法進行特徵轉換降維、(五)組合上述步驟建立快速學習與預測模型。
    此外,本論文另提出分組特徵選取與轉換方法以提高判斷率,並且在特徵分析過程,本文展示一簡單有效之方式找出各類病毒的主要行為模式。本論文有效收集4,288隻程式涵蓋8類病毒與1類非病毒進行實驗,實驗證明二階段特徵降維可大量減少模型學習分析時間,結合分組TF-IDF、PCA及SVM所建立之分類學習器,可對數十萬高維特徵數據在數十秒內完成分類器之重新學習及預測,實驗結果驗證本文方法之效能與效率具競爭力。


    The explosive amount of malware continue their threats in network and operating systems. Signature-based method is widely used for detecting malware. Unfortunately, it is unable to determine variant malware on-the-fly. On the hand, behavior-based method can effectively characterize the behaviors of malware. However, it is time-consuming to train and predict for each specific family of malware.
    We propose a generic and efficient algorithm to classify malware. Our method combines the selection and the extraction of features, which significantly reduces the dimensionality of features for training and classification. Based on malware behaviors collected from a sandbox environment, our method proceeds in five steps: (a) extracting n-gram feature space data from behavior logs, (b) building a support vector machine (SVM) classifier for malware classification, (c) selecting a subset of features, (d) transforming high-dimensional feature vectors into low-dimensional feature vectors, and (e) selecting models.
    Furthermore, we propose a Multi-Grouping algorithm for each feature reduction method. During the feature selection and extraction process, we show a easy way to figure out the major behaviors for each malware type. Experiments were conducted on a real-world data set with 4,288 samples from 9 families. As a proof of concept, we have evaluated our method by online training simulation experiment. Our 2-stages dimensionality reduction approach could have reduced the time cost significantly. The combination of MG TF-IDF, PCA and SVM for online training can finish the re-training and classifying in seconds, is sufficient to meet the online learning requirement for collecting the malware behavior in every minute. The experiments were demonstrated the effectiveness and the efficiency of our approach.

    書名頁... i 中文摘要... ii 英文摘要... iii 誌謝... iv 目次... v 表目次... vii 圖目次... ix 名詞與符號說明... x 1、INTRODUCTION... 1 2、RELATED WORK... 3 3、METHODOLOGY... 7 3.1 Behavior Monitoring and Data Preprocessing... 8 3.2 Training and Testing... 9 3.3 Feature Selection Analysis... 11 3.4 Feature Extraction Analysis... 13 3.4.1 Principal Component Analysis... 13 3.4.2 Kernel Principal Component Analysis... 16 3.5 Model Selection and Online Extension... 17 4、EXPERIMENT ... 23 4.1 Behavior Monitoring and Data Preprocessing ... 23 4.2 Feature Selection Analysis ... 28 4.3 Feature Extraction Analysis ... 38 4.4 Model Selection and Online Extension ... 45 5、CONCLUSIONS ... 56 參考文獻... 57 附錄A: PCA analysis result in the bigram to four-gram test ... 64 附錄B: The first principal component of MG PCA analysis result in the bigram to four-gram test ... 69 附錄C: MATLAB / Machine Learning 程式列表... 81 附錄D: MATLAB / 特徵提取程式列表... 93 附錄E: MATLAB / Online Simulation 程式列表... 105

    [1] Symantec, INTERNET SECURITY THREAT REPORT 2011 Trends, Volume 17, April 2012. Symantec. [Online].Available: http://www.symantec.com/threatreport/
    [2] AV-Comparatives.org, Anti-Virus Comparative - Proactive/retrospective test,May 2009. [Online]. Available: http://www.av-comparatives.org/images/docs/avc_beh_200905_en.pdf
    [3] P. Baecher, M. Koetter, T. Holz, M. Dornseif, and F. C. Freiling, The nepenthes platform: An efficient approach to collect malware, in Proceedings of the 9th Symposium on Recent Advances in Intrusion Detection (RAID06), 2006, pp.165-184.
    [4] U. Bayer, C. Kruegel, and E. Kirda, TTAnalyze: A tool for analyzing malware, in Proceedings of the 15th European Institute for Computer Antivirus Research (EICAR 2006) Annual Con-ference, April 2006, pp. 180–192.
    [5] U. Bayer, A. Moser, C. Kruegel, and E. Kirda, Dynamic analysis of malicious code, Journal in Computer Virology, Vol. 2, 2006, pp. 67-77.
    [6] X. Jiang and D. Xu, Collapsar: A VM-based architecture for network attack detention center, in Proceedings of the 13th USENIX Security Symposium, Vol. 29, No. 6, 2004, pp.65-66.
    [7] C. Leita, M. Dacier, and F. Massicotte, Automatic handling of protocol dependencies and reaction to 0-day attacks with ScriptGen based honeypots, in Proceedings of the 9th Symposium on Recent Advances in Intrusion Detection (RAID06), Sep 2006, pp.185-205.
    [8] A. Moser, C. Kruegel, and E. Kirda, Exploring multiple execution paths for malware analysis, in Proceedings of 2007 IEEE Symposium on Security and Privacy, 2007, pp. 231-245.
    [9] Norman, Norman sandbox information center, Internet: http://sandbox.norman.no/, Accessed: 2007.
    [10] F. Pouget, M. Dacier, and V. H. Pham, Leurre.com: on the advantages of deploying a large scale distributed honeypot platform, In ECCE’05, E-Crime and Computer Conference, Monaco, Mar 2005, http://www.eurecom.fr/publication/1558.
    [11] M. Vrable, J. Ma, J. Chen, D. Moore, E. Vandekieft, A. C. Snoeren, G. M. Voelker, and S. Savage, Scalability, fidelity, and containment in the potemkin virtual honeyfarm,ACM Symposium on Operating System Principles (SOSP), Vol.39(5), 2005, pp. 148-162.
    [12] C. Willems, T. Holz, and F. Freiling, CWSandbox: Towards automated dynamic binary analysis, IEEE Security and Privacy, Vol. 5, No. 2, 2007, pp. 32-39.
    [13] M. Egele, T. S Scholte, E. Kirda, and C. Kruegel, A Survey on Automated Dynamic Malware-Analysis Techniques and Tools, ACM Computing Surveys, Vol. 44, No. 2, Article 6, February 2012, pp. 6:1-6:42.
    [14] K. Rieck, T. Holz, C. Willems, P. Dussel, and P. Laskov, Leaming and classification of malware behavior, in Proceedings of the 5th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2008, pp. 108-125.
    [15] C. Cortes and V. Vapnik, Support-vector network, Machine Learning, Vol. 20, 1995, pp. 273-297.
    [16] C. Hsu and C. Lin, A Comparison of Methods for Multiclass Support Vector Machines, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 13, No. 2, MARCH 2002, pp. 274 - 282.
    [17] C. C. Chang, C. J. Lin, LIBSVM, a library for support vector machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm , 2012.
    [18] G. Salton, and M. J. McGill, Introduction to modern information retrieval, McGraw-Hill, ISBN 0-07-054484-0, 1986.
    [19] Li-Ping Jing, Hou-Kuan Huang, and Hong-Bo Shi, Improved feature selection approach TFIDF in text mining, Machine Learning and Cybernetics, Vol. 2, 2002, pp. 944-946.
    [20] I. T. Jolliffe, Principal Component Analysis, 2nd edition, Springer, 2002.
    [21] B. Scholkopf, A. Smola, and K.-R. Muller , Kernel principal component analysisArtificial Neural Networks - ICANN’97, Vol.1327, 1997, pp. 583-588.
    [22] M. Karg, R. Jenke1, W. Seiberl, K. Kuhnlenz1, A. Schwirtz2, M. Buss, Comparison of PCA, KPCA and LDA for Feature Extraction to Recognize Affect in Gait Kinematics, Affective Computing and Intelligent Interaction and Workshops, 2009, ACII 2009. 3rd International Conference on, 10-12 Sept. 2009, pp. 1-6.
    [23] A. Bordes, S. Ertekin, J. Weston, and L. Bottou, Fast Kernel Classifiers with Online and Active Learning, Journal of Machine Learning Research, Vol. 6, 2005, pp. 1579-1619.
    [24] H. Sun, Y. Lin, and M. Wu., Api monitoring system for defeating worms and ex-ploits in ms-windows system, in ACISP’06 Proceedings of the 11th Australasian conference on Information Security and Privacy: proceedings, 2006, pp. 159-170.
    [25] K. Tsyganok, E. Tumoyan, M. Anikeev, and L. Babenko, Classification of poly-morphic and metamorphic malware samples based on their behavior, Proceedings of the 5th International Conference on Security of Information and Networks, SIN’12, pp. 111-116, 2012.
    [26] C. Wang, J. Pang, R. Zhao, W. Fu, and X. Liu, Malware Detection Based on Sus-picious Behavior Identification, in Proceedings of the 1st International Workshop on Education Technology and Computer Science( ETCS), 2009, pp. 198-202.
    [27] J. Hegedus, Y. Miche, A. Ilin, and A. Lendasse, Methodology for Behavior-al-based Malware Analysis and Detection using Random Projections and K-Nearest Neighbors Classifiers, in Proceedings - 2011 7th International Con-ference on Computational Intelligence and Security(CIS 2011), 2011, pp. 1016-1023.
    [28] S. Palahan, D. Babic, S. Chaudhuri, and D. Kifer, Extraction of Statistically Sig-nificant Malware Behaviors, in Proceedings - 29th Annual Computer Security Applications Conference(ACSAC), 2013, pp. 69-78.
    [29] J. Nakazato, J. Song, M. Eto, D. Inoue, and K. Nakao, A Novel Malware Clustering Method Using Frequency of Function Call Traces in Parallel Threads,IEICE Transactions on Information and Systems, Vol. E94-D, No. 11, NOVEMBER 2011, pp. 2150-2158.
    [30] S. Liu, H. Huang, and Y. Chen, A System Call Analysis Method with MapReduce for Malware Detection, in Proceedings - 2011 17th IEEE International Confer-ence on Parallel and Distributed Systems( ICPADS), 2011, pp. 631-637.
    [31] R. Weaver, Visualizing and Modeling the Scanning Behavior of the Conficker Botnet in the Presence of User and Network Activity, Information Forensics and Security, IEEE Transactions on, Vol.10, Issue 5 , April 2014, pp.1039-1051.
    [32] J. Rhee, R. Riley, Z. Lin, X. Jiang, D. Xu, Data-Centric OS Kernel Malware Characterization, Information Forensics and Security, IEEE Transactions on, Vol.9, Issue 1 , 2014, pp.72-87.
    [33] G. Pek, L. Buttyan, Towards the automated detection of unknown malware on live systems, Communications (ICC), 2014 IEEE International Conference on, 2014, pp.847-852.
    [34] D. Uppal, R. Sinha, V. Mehra, V. Jain, Exploring Behavioral Aspects of API Calls for Malware Identification and Categorization, Computational Intelligence and Communication Networks (CICN), 2014 International Conference on, 2014, pp.824-828.
    [35] S.K. Pandey, B.M. Mehtre, A Lifecycle Based Approach for Malware Analysis, Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, 2014, pp.767-771.
    [36] M.B. Bahador, M. Abadi, A. Tajoddin, HPCMalHunter: Behavioral malware detection using hardware performance counters and singular value decomposition, Computer and Knowledge Engineering (ICCKE), 2014 4th International eConference on, 2014, pp.703-708.
    [37] M. Aghaeikheirabady, S.M.R. Farshchi, H. Shirazi, A new approach to malware detection by comparative analysis of data structures in a memory image, Technology, Communication and Knowledge (ICTCK), 2014 International Congress on, 2014, pp.1-4.
    [38] E.I. Edem, C. Benzaid, A. Al-Nemrat, P. Watters, Analysis of Malware Behaviour: Using Data Mining Clustering Techniques to Support Forensics Investigation, 2014 Fifth Cybercrime and Trustworthy Computing Conference (CTC), 2014, pp.54-63.
    [39] Y. Qin, Q. Wang, Y. Zeng, Q. Xi, A parallel target-directed analysis method for malware behaviors, Cyberspace Technology (CCT 2014), International Conference on, 2014, pp.1-5.
    [40] H. Dornhackl, K. Kadletz, R. Luh, P. Tavolato, Defining Malicious Behavior, Availability, Reliability and Security (ARES), 2014 Ninth International Conference on, 2014, pp.273-278.
    [41] S. Kumar, C. Rama Krishna, N. Aggarwal, R. Sehgal, S. Chamotra, Malicious data classification using structural information and behavioral specifications in executables, Engineering and Computational Sciences (RAECS), 2014 Recent Advances in, 2014, pp.1-6.
    [42] S.A. Musavi, M. Kharrazi, Back to Static Analysis for Kernel-Level Rootkit Detection, Information Forensics and Security, IEEE Transactions on, Vol.9, Issue 9, 2014, pp.1465-1476.
    [43] C.A. Borges de Andrade, C. Gomes de Mello, J.C. Duarte, Malware Automatic Analysis, Computational Intelligence and 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC), 2013 BRICS Congress on, 2013, pp.681-686.
    [44] M. Barat, D.B. Prelipcean, D.T. Gavrilut, An Automatic Updating Perceptron-Based System for Malware Detection, Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2013 15th International Symposium on, 2013, pp.303-307.
    [45] H.T. Wang, C.H. Mao, T.E. Wei, H.M.g Lee, Clustering of Similar Malware Behavior via Structural Host-Sequence Comparison, Computer Software and Applications Conference (COMPSAC), 2013 IEEE 37th Annual, 2013, pp.349-358.
    [46] L. Shi, J. Que, Z. Zhong, B. Meyer, P. Crenshaw, Yu. He, A Scalable Implementation of Malware Detection Based on Network Connection Behaviors, Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2013 International Conference on, 2013, pp.59-66.
    [47] J.Y.-C. Cheng, T.S. Tsai, C.S. Yang, An information retrieval approach for malware classification based on Windows API calls, Machine Learning and Cybernetics (ICMLC), 2013 International Conference on, Vol.4, 2013, pp.1678-1683. ”
    [48] S. Nari, A.A. Ghorbani, Automated malware classification based on network behavior, Computing, Networking and Communications (ICNC), 2013 International Conference on, 2013, pp.642-647.
    [49] Y. Qiao, J. He, Y. Yang, L. Ji, Analyzing Malware by Abstracting the Frequent Itemsets in API Call Sequences, Trust, Security and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International Conference on, 2013, pp.265-270.
    [50] K. Blokhin, J. Saxe, D. Mentis, Malware Similarity Identification Using Call Graph Based System Call Subsequence Features, Distributed Computing Systems Workshops (ICDCSW), 2013 IEEE 33rd International Conference on, 2013, pp.6-10.
    [51] A. Pfeffer, C. Call, J. Chamberlain, L. Kellogg, J. Ouellette, T. Patten, G. Zacharias, A. Lakhotia, S. Golconda, J. Bay, R. Hall, D. Scofield, Malware Analysis and attribution using Genetic Information, Malicious and Unwanted Software (MALWARE), 2012 7th International Conference on, 2012, pp.39-45.
    [52] Y.H. Choi, B.J. Han, B.C. Bae, H.G. Oh, K.W. Sohn, Toward extracting malware features for classification using static and dynamic analysis, Computing and Networking Technology (ICCNT), 2012 8th International Conference on, 2012, pp.126-129.
    [53] M. Eskandari, Z. Khorshidpur, S. Hashemi, To Incorporate Sequential Dynamic Features in Malware Detection Engines, Intelligence and Security Informatics Conference (EISIC), 2012 European, 2012, pp.46-52.
    [54] Z. Salehi, M. Ghiasi, A. Sami, A miner for malware detection based on API function calls and their arguments, Artificial Intelligence and Signal Processing (AISP), 2012 16th CSI International Symposium on, 2012, pp.563-568.
    [55] M. Ghiasi, A. Sami, Z. Salehi, Dynamic malware detection using registers values set analysis, Information Security and Cryptology (ISCISC), 2012 9th International ISC Conference on, 2012, pp.54-59.
    [56] J.C. Acosta, H. Mendoza, B.G. Medina, An efficient common substrings algorithm for on-the-fly behavior-based malware detection and analysis, MILITARY COMMUNICATIONS CONFERENCE, 2012 - MILCOM 2012, 2012, pp.1-6.
    [57] W. Liu, P. Ren, K. Liu, H. Duan, Behavior-Based Malware Analysis and Detection, Complexity and Data Mining (IWCDM), 2011 First International Workshop on, 2011, pp.39-42.
    [58] S. Y. Kung, Kernel Methods and Machine Learning, Cambridge University Press, ISBN 9781107024960, April 2014.

    QR CODE