簡易檢索 / 詳目顯示

研究生: 梁珪信
Kuei-Hsin Liang
論文名稱: 針對以密度為基礎之分群方法辨別雜訊後之精煉
Refinement After Density-based Clustering on Dirty Data
指導教授: 戴碧如
Bi-Ru Dai
口試委員: 吳怡樂
Yi-Leh Wu
戴志華
Chih-Hua Tai
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 29
中文關鍵詞: 以密度為基礎之分群方法分群雜訊雜訊再利用
外文關鍵詞: DNSCAN, noise reuse
相關次數: 點閱:255下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

分群演算法是現今找出相同類別(高相似性)資料相當普遍的知識發現工具,雜訊(noise)是演算法判定該筆資料不隸屬於任何一群之結果,有時雜訊的產生是因該筆資料本為無用資訊,然而亦有因收集資料之環境、遮蔽物、傳感器(sensor)老舊或演算法參數設定不正確等因素之可能性,此時這些有意義之雜訊將失去其背後資訊意義而被誤判且遭大多數移除雜訊之方法省略而導致分群準確度下降。
本篇論文中,我們針對以密度為基礎之分群方法所得到之雜訊進行修正,利用DBSCAB演算法原有的兩參數定義修正方法之限制,這些有意義的雜訊將被修正至正確的群或自成一群。我們將此精練DBSCAN雜訊的演算法稱為RaC-DBSCAN。在最後的實驗章節中我們採納混合資料集、UCI資料集和真實現實中之資料集並和原始方法比較,我們的演算法有效的提高分群的準確度並充分的使用被誤分為雜訊之資料。


Clustering algorithms are efficient for the task of class identification in spatial databases. Noise after clustering sometimes is meaningful due to mistake by inappropriate parameters setting or environmental factor in collecting data, we call them “dirty data”. Removal of these noise methods loss considerable information because ignoring dirty data which in some part is meaningful.
In this paper, we present a method to refine the result of density-based clustering which two parameters assist our proposed definition complete. We performed kinds of experimental evaluation of effectiveness of refinement cooperating with DBSCAN, most famous density-based clustering algorithm in various application, called RaC-DBSCAN in synthetic dataset, UCI dataset and real dataset. The results of our experiments demonstrate that RaC-DBSCAN no matter enhance precision of identify each cluster but also generate potential by further utilize dirty data.

封面 ………………………………………………………………………………….I 指導教授推薦書 …………………………………………………………..……II 學位口試委員審定書 .…………………...……………………...…………….III Abstract ......……………………………………………………………….…...…IV 論文摘要 ……………………………………………………………...…………....V 誌謝 ……………………………………………………………...………………….VI Table of Contents ..……………………………………...………………………VII List of Figures ...………………………………………...………………………...VIII 1 Introduction .…………………………..……………………………………1 1.1 Background ………………………...………………………………………1 1.2 Motivation .…………………….………………………………………….2 2 Related Work ………………………………………………………………...3 2.1 Clustering Algorithm Comparing …………….…………………………...3 2.2 Density-based Clustering-DBSCAN ……..…………………………….3 3 Proposed Method …………………...……………………………………...5 3.1 DBSCAN Problem .…………………….…………………………………….5 3.2 Refinement after DBSCAN ..………...……………………………………7 4 Experimental Result ………….…………………………………………12 4.1 Artificial Dataset ………………..…………………………………………12 4.2 UCI Dataset .…………………..……………………………………………15 4.3 Real Dataset …………………..…………………………………………….17 5 Conclusion …………..…………………………………………………….19 Reference ...…….………………………………………………………………20

[1] Hans‐Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek. "Density‐based clustering." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery1.3, 2011, pp. 231-240.
[2] Martin Ester, Hans-Peter Kriegel, Jiirg Sander, Xiaowei Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise." Kdd. Vol. 96, No. 34, 1996, pp.226-231.
[3] Shraddha Pandit, and Suchita Gupta. "A comparative study on distance measuring approaches for clustering." International Journal of Research in Computer Science 2.1, 2011, pp. 29-31.
[4] Santosh Kumar Uppada. "Centroid Based Clustering Algorithms-A Clarion Study." International Journal of Computer Science and Information Technologies, Vol. 5 (6), 2014, pp. 7309-7313
[5] Jamshid Esmaelnejad, Jafar Habibi, and Soheil Hassas Yeganeh. "A novel method to find appropriate ε for DBSCAN." Asian Conference on Intelligent Information and Database Systems. Springer, Berlin, Heidelberg, 2010, pp. 93-102.
[6] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. NG, Jörg Sander. "LOF: identifying density-based local outliers." ACM sigmod record. Vol. 29. No. 2. ACM, 2000, pp. 93-104.
[7] Feng Cao, Martin Ester, Weining Qian, and Aoying Zhou. "Density-based clustering over an evolving data stream with noise." Proceedings of the 2006 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2006, pp. 328-339.
[8] Yixin Chen, and Li Tu. "Density-based clustering for real-time stream data." Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007, pp. 133-142.
[9] William M. Rand. "Objective criteria for the evaluation of clustering methods." Journal of the American Statistical association 66.336, 1971, pp. 846-850.
[10] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Vol. 1. No. 1. Cambridge: Cambridge university press, 2008, pp.496.
[11] Dingqi Yang, Daqing Zhang, Zhiyong Yu and Zhiwen Yu. "Fine-Grained Preference-Aware Location Search Leveraging Crowdsourced Digital Footprints from LBSNs. " Proceeding of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, in Zurich, Switzerland, September 8-12 2013, pp. 479-488.
[12] Dingqi Yang, Daqing Zhang, Zhiyong Yu and Zhu Wang. "A Sentiment-enhanced Personalized Location Recommendation System. " Proceeding of the 24th ACM Conference on Hypertext and Social Media. ACM, in Paris, France, May 2013, pp. 119-128.
[13] Dingqi Yang, Daqing Zhang, Zhiyong Yu, Zhiwen Yu, Djamal Zeghlache. "SESAME: Mining User Digital Footprints for Fine-Grained Preference-Aware Social Media Search. " ACM Trans. on Internet Technology (TOIT), 14.4, 2014, pp. 28.
[14] Sanghamitra Bandyopadhyay, Chivukula A. Murthy and Sankar K. Pal, "Pattern Classification Using Genetic Algorithms", Pattern Recognition Letters, vol. 16, August 1995, pp. 801-808.
[15] Sanghamitra Bandyopadhyay, and Sankar K. Pal. Classification and learning using genetic algorithms: applications in bioinformatics and web intelligence. Springer Science & Business Media, 2007.
[16] George Karypis, Eui-Hong Han, and Vipin Kumar. "Chameleon: Hierarchical clustering using dynamic modeling." Computer 32.8, 1991, pp. 68-75.
[17] Shaoxu Song, Chunping Li, and Xiaoquan Zhang. "Turn waste into wealth: On simultaneous clustering and cleaning over dirty data." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp.1115-1124.
[18] Sudipto Guha, and Nina Mishra. "Clustering data streams." Data Stream Management. Springer, Berlin, Heidelberg, 2016, pp.169-187.

QR CODE