簡易檢索 / 詳目顯示

研究生: 李政益
Zheng-Yi Lee
論文名稱: 透過多倍取樣之主成份分析的異常偵測
Anomaly Detection via Over-sampling Principal Component Analysis
指導教授: 李育杰
Yuh-Jye Lee
口試委員: 楊傳凱
Chuan-Kai Yang
鮑興國
Hsing-Kuo Kenneth Pao
項天瑞
Tien-Ruey Hsiang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 56
中文關鍵詞: 資料探勘異常偵測主成份分析多倍取樣Leave One Out
外文關鍵詞: data mining, outlier detection, principal component analysis, over-sampling, Leave One Out
相關次數: 點閱:221下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在資料探勘(data mining)上,異常偵測(outlier detection)是一個重要的議題且在其他不同的研究領域也被多方研究。異常偵測可被用來偵測出少量的偏差資料(deviated data)。在這篇論文中,我們使用了“Leave One Out”的方式來檢視個別點“有與沒有”(with or without)該點對於資料的主要方向的影響力。以這個想法為基礎,我們提出了多倍取樣主成份分析的異常偵測方法來強調異常點的影響力。除了找出可疑的異常點外,我們也設計出一個線上(on-line)異常偵測演算法來偵測出新進來的異常點。除此之外,我們也研究了資料的主要方向之快速更新公式以達到一個快速的計算使其滿足線上偵測需求。數值上的實驗展示出我們提出的方法在計算時間上與異常偵測上都具有相當好的表現。


    Outlier detection is an important issue in data mining and has been studied in different research areas. It can be used for detecting a small amount of deviated data. In this article, we use ``Leave One Out" procedure to check each individual point for the ``with or without" effect on the variation of principal directions. Based on this idea, an over-sampling principal component analysis outlier detection method is proposed for emphasizing the influence of an abnormal instance (or an outlier). Except for identifying the suspicious
    outliers, we also design an on-line anomaly detection to detect the new arriving anomaly. In addition, we also study the quick updating of the principal directions for an effective computation to satisfy the on-line detecting demand. Numerical experiments show that our proposed method is effective in computation time and anomaly detection.

    1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Our Main Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Framework of Unsupervised Anomaly Detection 5 2.1 Misuse Detection vs. Anomaly Detection . . . . . . . . . . . . . . . . . . . 5 2.2 Data Cleaning and On-line Detection Phase . . . . . . . . . . . . . . . . . 6 3 Related Work 8 3.1 Distribution-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Depth-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Support-vector-machines-based . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 Distance-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5 Density-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.6 Cluster-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.7 Angle-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.8 Spectrum-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.8.1 Mahalanobis-distance-based . . . . . . . . . . . . . . . . . . . . . . 24 3.8.2 Reconstruction-error-based . . . . . . . . . . . . . . . . . . . . . . . 27 3.8.3 Cosine-similarity-based . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Over-sampling Principal Component Analysis 30 4.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.1 Treat PCA as a Variance Maximization Problem . . . . . . . . . . . 32 4.1.2 Treat PCA as a Square Error Minimization Problem . . . . . . . . 34 4.2 The In°uence of An Outlier on Principal Directions . . . . . . . . . . . . . 36 4.3 Over-sampling Principal Component Analysis . . . . . . . . . . . . . . . . 37 4.4 Data Cleaning and On-line Anomaly Detection . . . . . . . . . . . . . . . . 41 4.5 Speedup for the Power Method . . . . . . . . . . . . . . . . . . . . . . . . 43 5 Experimental Results 44 5.1 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.1 Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.2 UCI Pendigits Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.3 KDD Cup 99 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6 Discussion, Conclusion and Future Work 50 6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    [1] E. Alpaydin. Introduction to Machine Learning. The MIT Press, 2004.
    [2] A. Asuncion and D. J. Newman. UCI repository of machine learning databases.
    http://www.ics.uci.edu/»mlearn/mlrepository.html,2007.
    [3] V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley, 1994.
    [4] J. Beale and B. Caswell (Editor). Snort 2.1 Intrusion Detection. Syngress, 2004.
    [5] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik. Support vector clustering. Journal of Machine Learning Research, 2:125-137, 2001.
    [6] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18:509-517, 1975.
    [7] M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.
    [8] D. P. Bertsekas. Nonlinear Programming. Athena Scienti‾c, Belmont, MA, 1999.
    [9] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:1145-1159, 1997.
    [10] E. J. Bredensteiner and K. P. Bennett. Multicategory classi‾cation by support vector machines. Computational Optimization and Applications, 12:53-79, 1999.
    [11] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, 2000.
    [12] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121-167, 1998.
    [13] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: a survey. Technical report, ACM Computing Surveys, 2009.
    [14] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT press, Cambridge, MA, USA, 1990.
    [15] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994.
    [16] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, 2000.
    [17] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.
    Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.
    [18] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector regression machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems -9-, pages 155-161. MIT Press, Cambridge, MA, 1997.
    [19] R. O. Duda, E. Hart P, and D. G. Stork. Pattern Classi‾cation. John Wiley & Sons, USA, 2nd edition, 2001.
    [20] D. Erdogmus, Y. N. Rao, H. Peddaneni, A. Hegde, and J. C. Principe. Recursive principal components analysis using eigenvector matrix perturbation. Jornal of Applied Signal Processing, 13:2034-2041, 2004.
    [21] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of Data Mining in Computer Security, 2002.
    [22] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, MD, USA, 1983.
    [23] S. R. Gunn. Support vector machines for classi‾cation and regression. Technical report, Image Speech and Intelligent Systems Research Group, University of Southampton, 1997.
    [24] D. J. Hand and R. J. Till. A simple generalisation of the area under the roc curve for multiple class classi‾cation problems. Machine Learning, 45:171-186, 2001.
    [25] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer series in statistics, Springer, New York, 2001.
    [26] D. M. Hawkins. Identi‾cation of Outliers. Chapman and Hall, London, 1980.
    [27] H. Hoffmann. Kernel pca for novelty detection. Pattern Recognition, 40:863-874, 2007.
    [28] T. Johnson, I. Kwok, and R. Ng. Fast computation of 2-dimensional depth contours. In Proceedings fourth International Conference on Knowledge Discovery and Data Mining, pages 224-228, 1998.
    [29] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.
    [30] KDD Cup 1999 Data.
    http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html,August, 2003.
    [31] E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the VLDB Conference, New York, USA, pages 392-403, 1998.
    [32] H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, 2008.
    [33] A. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, and V. Kumar. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the Third SIAM International Conference on Data Mining, 2003.
    [34] Y.-J. Lee, W.-F. Hsieh, and C.-M. Huang. ²-SSVR: a smooth support vector machine for ²-insensitive regression. IEEE Transactions on Knowledge and Data Engineering, 17(5):678-685, 2005.
    [35] Y.-J. Lee, O. L. Mangasarian, and W. H. Wolberg. Survival-time classi‾cation of breast cancer patients. Computational Optimization and Applications, 25:151-166, 2003.
    [36] S. J. Leon. Linear Algebra with Applications. Prentice Hall, 2006.
    [37] C.-X. Ling, J. Hunag, and H. Zhang. AUC: a statistically consistent and more discriminating measure than accuracy. In Proceedings of 18th Internationa Conferenceon Artificial Intelligence (IJCAI-2003), 2003.
    [38] O. L. Mangasarian. Nonlinear Programming. Society for Industrial and Applied Mathematics, Philadelphia, 1994.
    [39] M. Markou and S. Singh. Novelty detection: a review - part 1: statistical approaches. Signal Processing, 83(12):2481-2497, 2003.
    [40] M. Markou and S. Singh. Novelty detection: a review - part 2: neural network based approaches. Signal Processing, 83(12):2499-2521, 2003.
    [41] MATLAB. User's Guide. The MathWorks, Inc., Natick, MA 01760, 1994-2001.
    http://www.mathworks.com.
    [42] K. Mcgarry. A survey of interestingness measures for knowledge discovery. The Knowledge Engineering Review, 20:39-61, 2005.
    [43] J. C. Platt. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. MIT Press, Cambridge, MA, USA, 1999.
    [44] J. C. Platt. Sequential minimal optimization: a fast algorithm for training support vector machines. In B. SchÄolkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185-208. MIT Press, Cambridge, MA, 1999. http://www.research.microsoft.com/»jplatt/smo.html.
    [45] P. A. Porras and P. G. Neumann. Emerald: event monitoring enabling responses to anomalous live disturbances. In Proceedings of National Informaion Systems Security Conference, 1997.
    [46] L. Portnoy. Intrusion detection with unlabeled data using clustering. Master's thesis, Columbia University, Department of Computer Science, 2000.
    [47] F. Preparata and M. Shamos. Computational Geometry: An Introduction. Springer, 1988.
    [48] S. Ramaswamy, R. Rasttogi, and K. Shim. E±cient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD Conference, 2000.
    [49] Y. N. Rao and J. C. Principe. A fast, on-line algorithm for pca and its convergence characteristics. In Proceedings IEEE Workshop on Neural Networks for Signal Processing X, pages 299-308, 2000.
    [50] S. Rawat, A. K. Pujari, and V. P. Gulati. On the use of singular value decomposition for a fast intrusion detection system. Electronic Notes in Theoretical Computer Science, 142:215-228, 2006.
    [51] I. Ruts and P. Rousseeuw. Computing depth contours of bivariate point clouds. Journal of Computational Statistics and Data Analysis, 23:153-168, 1996.
    [52] B. SchÄolkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443-1471, 2001.
    [53] B. SchÄolkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.
    [54] B. SchÄolkopf, A. J. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299-1319, 1998.
    [55] B. SchÄolkopf, A. J. Smola, and K.-R. Muller. Kernel principal component analysis. In B. SchÄolkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 327-352. MIT Press, Cambridge, MA, 1999b.
    [56] B. SchÄolkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. Platt. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12, pages 526-532, 1999.
    [57] M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, and L. Chang. A novel anomaly de-
    tection scheme based on principal component classi‾er. In Proceedings of 3rd IEEE International Conference on Data Mining, pages 353-365, 2003.
    [58] A. J. Smola and B. SchÄolkopf. A tutorial on support vector regression. Technical report, Produced as part of the ESPRIT Working Group in Neural and Computational Learning II, NeuroCOLT2 27150, 1998.
    [59] D. M. J. Tax and R. P. W. Duin. Support vector domain description. Pattern
    Recognition Letters, 20:1191-1199, 1999.
    [60] L. N. Trefethen and D. Bau. Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, 1997.
    [61] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1977.
    [62] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, New York, 1982.
    [63] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
    [64] W. Wang and R. Battiti. Identi‾ng intrusions in computer networks with principal component analysis. In Proceedings of the First International Conference on Availability, Reliability and Security (ARES-2006), pages 270-279, 2006.
    [65] W. Wang, X. Guan, and X. Zhang. A novel intrusion detection method based on
    principal component analysis in computer security. In Proceedings of the International Symposium on Neural Networks, Dalian, China, pages 657-662, 2004.
    [66] K. Yamanishi, J.-I. Takeuchi, G. Williams, and P. Milne. On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery, 8(3):275-300, 2004.
    [67] Y.-R. Yeh, Z.-Y. Lee, and Y.-J. Lee. Anomaly detection via over-sampling principal component analysis. In New Advances in Intelligent Decision Technologies, pages 449-458, 2009.

    QR CODE