簡易檢索 / 詳目顯示

研究生: 蔣勤彥
Alvin Chin-Yen Chiang
論文名稱: 融合式異常偵測之探究
A Study on Anomaly Detection Ensembles
指導教授: 李育杰
Yuh-Jye Lee
口試委員: 鮑興國
Hsing-Kuo Pao
鄧惟中
Wei-Chung Teng
葉倚任
Yi-Ren Yeh
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2015
畢業學年度: 104
語文別: 英文
論文頁數: 60
中文關鍵詞: 異常偵測資料探勘決策融合
外文關鍵詞: ensemble
相關次數: 點閱:281下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

所謂的異常指的是與大多數情況不同的事件或資料。這些異常的發生可能會對我們所關心的人或物造成損失,因此如何偵測到這些異常便成為一個重要的課題。詐欺、垃圾郵件、設備故障等,這些實際上應用例子就是我們想要找出的異常項目,我們主要是透過存在於資料中的偏離現象去偵測出它們。
要在大量資料中持續找這些異常不是件容易的事,因此可以自動量化資料的特異性之方法會是管控異常負責人或單位的重要工具。

多年來已有不少人提出量化資料異常性的指標和模型。我們在此論文中回顧一般常見的策略,再來探討如何藉由合併不同模型才能得到更好的成效。我們採用「群體智慧」的概念並提出一個用分群的方式合併異常分數的方法來偵測異常現象。藉由一般常見的資料集,我們對現有的融合模型與我們提出的融合分數策略進行穩定性與精確度的相關實驗。我們的結論是,分數融合策略並不能大量提升精確度,但是能提供模型的穩健度,緩衝那些低於正常的模型所造成的損失。此外對於隨機化的融合模型我們也有所討論。我們的研究結果指出,現代的異常偵測還有很多的難題和挑戰。


An anomaly, or outlier, is something that is different from the rest. These differences may ultimately correspond to an object or event of interest, the detection of which often proves to be of great importance or interest. For example fraud, spam, and device malfunctions correspond to events which need to be noticed and to do so we characterize them by their deviation from normality. By automating the creation of a ranking or list of what is most deviant, we can save time and decrease the cognitive overload of the individuals or groups responsible for responding to such events.

Over the years many anomaly and outlier metrics and detection methods have been developed for the purpose of finding data incongruencies. In this thesis we review the general strategies and measures used to characterize the `strangeness' of data, as well as how these separate methods may be combined. Under the assumption that ``the crowd is wise'', we adopt an eclectic approach and propose a clustering-based score ensembling method for outlier detection. Using benchmark datasets we evaluate quantitatively the robustness and accuracy of different ensemble strategies. We find that ensembling strategies offer only limited value for increasing overall performance, but provide robustness and protection from underperforming models. We also discuss the use of randomization to create ensemble-based methods. Based on our results we conclude that, given the current state-of-the-art, unsupervised anomaly detection faces significant challenges.

1 Introduction 1.1 Motivation and Problem Statement 1.2 Thesis Structure 2 Related Work 2.1 Outlier Detection Methods 2.1.1 Statistical Threshold Methods 2.1.2 Neighborhood Methods 2.1.3 Model-based Methods 2.1.4 Outlier Detection via Clustering 2.2 Ensemble Methodology 2.2.1 Ensembles in Machine Learning 3 Using Ensemble Methodology for Anomaly Detection 3.1 Outlier Detection Ensembles 3.1.1 Background 3.2 Outlier Ensemble Problem Definition 3.2.1 Proposed Methods 3.3 Randomized Internal Ensembles 3.3.1 Random KNN 3.3.2 Sliding-Window LOF 4 Empirical Results 4.1 Experiment Setting 4.1.1 Measuring Effectiveness of An Outlier Detector 4.1.2 Datasets 4.1.3 Software 4.2 Experiments and Results 4.2.1 Experiments 4.2.2 Analysis of Results 4.3 Discussion 5 Conclusions and Future Work 5.1 Future Work

[1] Kdd-cup-99 task description. https://kdd.ics.uci.edu/databases/kddcup99/
task.html.
[2] Mit lincoln laboratory: Darpa intrusion detection evaluation. http://www.ll.mit.
edu/ideval/data/.
[3] Charu C. Aggarwal. Outlier ensembles: position paper. SIGKDD Explorations,
14(2):49{58, 2012.
[4] Uri Alon, Naama Barkai, Daniel A Notterman, Kurt Gish, Suzanne Ybarra, Daniel
Mack, and Arnold J Levine. Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy of Sciences, 96(12):6745{6750, 1999.
[5] Andrew P Bradley. The use of the area under the roc curve in the evaluation of
machine learning algorithms. Pattern recognition, 30(7):1145{1159, 1997.
[6] Helio JV Braga, Michael A Choti, Vivian S Lee, Erik K Paulson, Evan S Siegelman,
and David A Bluemke. Liver lesions: Manganese-enhanced mr and dual-phase
helical ct for preoperative detection and characterization|comparison with receiver
operating characteristic analysis 1. Radiology, 223(2):525{531, 2002.
[7] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. Lof:
Identifying density-based local outliers. SIGMOD Rec., 29(2):93{104, May 2000.
[8] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,
Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler,
Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gael Varoquaux.
API design for machine learning software: experiences from the scikit-learn project.
In ECML PKDD Workshop: Languages for Data Mining and Machine Learning,
pages 108{122, 2013.
[9] Howard J Cabral. Statistical primer for cardiovascular research. Circulation, 117:698{
701, 2008.
[10] Raffaella Calabrese. The validation of credit rating and scoring models. In Swiss
Statistics Meeting, Geneva, Switzerland, 2009.
[11] Sanjay Chawla and Aristides Gionis. k-means-: A unified approach to clustering and
outlier detection. In SDM, pages 189{197, 2013.
[12] Andrew F. Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, andWeng-Keen
Wong. Systematic construction of anomaly detection benchmarks from real data. In
Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description,
ODD '13, pages 16{21. ACM, 2013.
[13] Hans-Peter Kriegel Peer Kroger Erich and Schubert Arthur Zimek. Interpreting and
unifying outlier scores. In 11th SIAM International Conference on Data Mining
(SDM), Mesa, AZ, volume 42. SIAM, 2011.
[14] Peter Flach. The many faces of roc analysis in machine learning. ICML Tutorial,
2004.
[15] Yoav Freund, Robert Schapire, and N Abe. A short introduction to boosting. Journal-
Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999.
[16] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of computer and system sciences,
55(1):119{139, 1997.
[17] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data
points. Science, 315:972{976, 2007.
[18] Jan-Mark Geusebroek, Gertjan J. Burghouts, and Arnold W. M. Smeulders. The
amsterdam library of object images. Int. J. Comput. Vision, 61(1):103{112, 1 2005.
[19] Inmar E Givoni and Brendan J Frey. A binary variable model for affinity propagation.
Neural computation, 21(6):1589{1600, 2009.
[20] Michael Gribskov and Nina L Robinson. Use of receiver operating characteristic (roc)
analysis to evaluate sequence matching. Computers & chemistry, 20(1):25{33, 1996.
[21] Anil K Jain. Data clustering: 50 years beyond k-means. ECML/PKDD (1), 5211:3{4,
2008.
[22] M. Kearns and L. G. Valiant. Crytographic limitations on learning boolean formulae
and finite automata. In Proceedings of the Twenty-first Annual ACM Symposium on
Theory of Computing, STOC '89, pages 433{444, New York, NY, USA, 1989. ACM.
[23] Michael Kearns. Thoughts on hypothesis boosting. Machine Learning class project,
Dec. 1988.
[24] Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Loop: local
outlier probabilities. In Proceedings of the 18th ACM conference on Information and
knowledge management, pages 1649{1652. ACM, 2009.
[25] Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In
Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge
Discovery in Data Mining, KDD '05, pages 157{166. ACM, 2005.
[26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278{2324, 11 1998.
[27] Yuh-Jye Lee, Chien-Chung Chang, and Chia-Huang Chao. Incremental forward feature
selection with application to microarray gene expression data. Journal of bio-
pharmaceutical statistics, 18(5):827{840, 2008.
[28] Yuh-Jye Lee, Yi-Ren Yeh, and Yu-Chiang FrankWang. Anomaly detection via online
oversampling principal component analysis. IEEE Transactions on Knowledge and
Data Engineering, 25(7):1460{1470, 2013.
[29] M. Lichman. UCI machine learning repository, 2013.
[30] Richard P Lippmann, David J Fried, Isaac Graf, Joshua W Haines, Kristopher R
Kendall, David McClung, Dan Weber, Seth E Webster, Dan Wyschogrod, Robert K
Cunningham, et al. Evaluating intrusion detection systems: The 1998 darpa off-line
intrusion detection evaluation. In DARPA Information Survivability Conference and
Exposition, 2000. DISCEX'00. Proceedings, volume 2, pages 12{26. IEEE, 2000.
[31] David MacKay. Information Theory, Inference and Learning Algorithms, chapter 20.
An Example Inference Task: Clustering. Cambridge University Press, 2003.
[32] Ann Michelle Morrison. Receiver operating characteristic (roc) curve preparation -
a tutorial, 2005. Report ENQUAD 2005-20.
[33] Spiros Papadimitriou, Hiroyuki Kitagawa, Philip B Gibbons, and Christos Faloutsos.
Loci: Fast outlier detection using the local correlation integral. In Data Engineering,
2003. Proceedings. 19th International Conference on, pages 315{326. IEEE, 2003.
[34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R.Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825{2830, 2011.
[35] Shebuti Rayana and Leman Akoglu. Less is more: Building selective anomaly ensembles.
arXiv preprint arXiv:1501.01924, 2015.
[36] Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Zu
e, Klaus Arthur
Schmid, and Arthur Zimek. A framework for clustering uncertain data. PVLDB,
8(12):1976{1987, 2015.
[37] Erich Schubert, Remigius Wojdanowski, Arthur Zimek, and Hans-Peter Kriegel. On
evaluation of outlier rankings and outlier scores. In Proceedings of the Twelfth SIAM
International Conference on Data Mining, Anaheim, California, USA, April 26-28,
2012., pages 1047{1058. SIAM / Omnipress, 2012.
[38] Jonathon Shlens. A tutorial on principal component analysis. arXiv preprint
arXiv:1404.1100, 2014.
[39] Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel
anomaly detection scheme based on principal component classifier. Technical report,
DTIC Document, 2003.
[40] Robin Sommer and Vern Paxson. Outside the closed world: On using machine
learning for network intrusion detection. In Security and Privacy (SP), 2010 IEEE
Symposium on, pages 305{316. IEEE, 2010.
[41] Roger M Stein. The relationship between default prediction and lending profits:
Integrating roc analysis and loan pricing. Journal of Banking & Finance, 29(5):1213{
1236, 2005.
[42] John A Swets. Signal detection theory and ROC analysis in psychology and diagnos-
tics: Collected papers. Psychology Press, 2014.
[43] Jian Tang, Zhixiang Chen, Ada Wai-Chee Fu, and David Wai-Lok Cheung. Enhancing
effectiveness of outlier detections for low density patterns. In Proceedings of the
6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining,
PAKDD '02, pages 535{548, London, UK, UK, 2002. Springer-Verlag.
[44] David H Wolpert. Stacked generalization. Neural networks, 5(2):241{259, 1992.
[45] Arthur Zimek, Ricardo J. G. B. Campello, and Jorg Sander. Ensembles for unsupervised
outlier detection: challenges and research questions a position paper. SIGKDD
Explorations, 15(1):11{22, 2013.
[46] Mark H Zweig and Gregory Campbell. Receiver-operating characteristic (roc) plots:
a fundamental evaluation tool in clinical medicine. Clinical chemistry, 39(4):561{577,
1993.

QR CODE