研究生: |
謝承翰 Cheng-Han Heish |
---|---|
論文名稱: |
基於網格之資料密度分群演算法 A Clustering Algorithm Based on Grids and Their Data Densities |
指導教授: |
楊英魁
Ying-Kuei Yang |
口試委員: |
黎碧煌
Bih-Hwang Lee 李建南 Chien-Nan Lee 張博綸 Po-Lun Chang |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 82 |
中文關鍵詞: | 資料探勘 、資料分群 、密度式分群法 、網格式分群法 、混合式分群法 、群集分析 |
外文關鍵詞: | data mining, data clustering, clustering validation, density-based clustering, grid-based clustering, hybrid clustering |
相關次數: | 點閱:310 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著現今科技的快速發展,企業公司獲取的資訊越發豐沛,且獲取資訊的管道也越來越多,但是要如何妥善運用這些龐大的資源,使其成為有意義的參考資訊,便需要資料探勘技術的協助,而資料分群即為其中的熱門研究領域,近年來更有多位學者相繼提出新的演算法來改善現存演算法的效率。
常見的分群演算法大致分為階層式、分割式以及密度式,本文研究所提出新的資料分群方法命名為基於網格之資料密度分群法,該演算法以密度式分群演算法DBSCAN為基礎,結合網格式分群法快速定義區域密度的優點,保留DBSCAN演算法中透過擴張來形成群集的概念,目的為加強密度式分群法的分群效能。
一般密度式分群法在分群時,隨機將未處理過的資料點作為初始點,並藉由搜索鄰近區域的單位密度是否符合需求來擴張形成群集,而整個資料空間中,單位密度的大小則是由一開始給予的參數組合決定,且到演算結束這參數組合都不會改變,這種分群方式若遇上群集間距離極近且密度不同的資料集時,可能無法正確進行分群。而本文提出的演算方式,是以密度式分群法為基礎,搭配網格分割的概念,快速計算當前資料空間資料密度最大區域所在,並自此區域開始擴張形成群集,每有一個群集形成,便重新計算資料空間密度最大區域,而後再次擴張,直到終止條件達成。使不同群集間,透過多次迭代而擁有各自獨立的參數組合,能夠處理一些一般密度式分群法所無法處理的資料集,且為了減少人為參數的使用,本文研究還搭配群集分析的概念,令演算法透過分析不同參數搭配得到的分群結果後,自行判斷出分群結果最為良好的參數組合。
Data clustering is one of methods for data mining, and it is also the main topic at this thesis. Data clustering can be divided into several categorizes as Partitional Clustering, Hierarchical Clustering and Density-Based Clustering. Density-Based Clustering can deal with much more kinds of datasets to compare with other methods. However, it has the limit when two clusters have different densities and short distance because the parameters of algorithm are the same.
The main purpose of this thesis is to enhance the efficacy of Density-Based Clustering, and makes the algorithm enable to process two clusters which have different densities and short distance.
The proposed algorithm which is named as A Clustering Algorithm Based on Grids and Their Data Densities in this thesis which is combined Density-based Clustering with Grid-Based Clustering. The algorithm can quickly obtain the density information from datasets by grid. In order to improve the efficacy of clustering, the initial point is set in the region which is the highest density. At the most important point, the proposed algorithm enhances the efficacy of Density-Based Clustering by adjust the parameters according to the density of each cluster.
[1] Pang-Ning Tan, Michael Steinbach and Vipn Kumar, 資料探勘, 台灣培生出版, 2008
[2] Ming-Syan Chen, Jiawei Han and Philip S. Yu, “Data mining: An Overview from a Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, VOL. 8(6), 1996
[3] J. W. Han and M. Kanber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001
[4] Rakesf Agrawal, Tomasz Imieliński and Arun Swami, “Mining association rules between sets of items in large databases”, ACM SIGMOD international conference on Management of data, 207-216, Washington, D.C., 1993
[5] R. Agrawal and R. Srikant, “Mining Sequential Patterns”, International Conference on Data Engineering, 3-14, Taipei, Taiwan, 1995
[6] MacQueen J. B. “Some Methods for classification and Analysis of Multivariate Observations”, Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 281-297, Statistical Laboratory of the University of California, Berkeley, 1967
[7] Martin Ester, Hans-Peter Kriegel, Jiirg Sander and Xiaowei Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, International Conference on Knowledge Discovery and Data mining, 226-231, Oregon, Portland, 1996
[8] Dongming Chen, Yun Yan and Dongqi Wang, “Density Clustering Based on Border-Expanding”, International Conference on Natural Computation, 670-674, Xiamen, China, 2014
[9] Dominik Kellner, Jens Klappstein and Klaus Dietmayer, “Grid-Based DBSCAN for Clustering Extended Objects in Radar Data” , IEEE Intelligent Vehicles Symposium (IV), 365-370, Madrid, Spain, 2012
[10] Linmeng Zhang, Zhigao Xu and Fengqi Si, “GCMDDBSCAN: Multi-Density DBSCAN Based on Grid and Contribution”, IEEE International Conference on Dependable, Autonomic and Secure Computing, 502-507, Chengdu, Sichuan, China, 2013
[11] Chen Xiaoyun, Min Yufang, Zhao Yan and Wang Ping, “GMDBSCAN: Multi-Density DBSCAN Cluster Based on Grid”, IEEE International Conference on e-Business Engineering, 780-783, Xian, China, 2008
[12] 王繶梃, 「一個使用網格空間切割技術與改良式取樣概念之密度式分群演算法」, 屏東科技大學, 資訊管理學系, 碩士論文, 2011
[13] 陳玟君, 「在動態資料庫中以較少運算重建頻繁樣式樹之方法」, 台灣科技大學, 電機工程學系, 碩士論文, 2014
[14] 張智星, 資料分群與樣式辨認, URL: http://mirlab.org/jang/books/dcpr/
[15] Tian Zhang, Raghu Ramakrishnan and Miron Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases”, ACM SIGMOD international conference on Management of data, 103-114, Montreal, Quebec, Canada, 1996
[16] Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, “CURE: An Efficient Clustering Algorithm for Large Databases”, ACM SIGMOD international conference on Management of data, 73-84, Seattle, Washington, USA, 1998
[17] George Karypis, Eui-Hong Han and Vipin Kumar, “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer, VOL. 32(8), 68-75, 1999
[18] Wei Wang, Jiong Yang and Richard Muntz, “STING : A Statistical Information Grid Approach to Spatial Data mining”, International Conference on Very Large Data Bases, 186-195, Athens, Greece, 1997
[19] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos and Prabhakar Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications”, ACM SIGMOD international conference on Management of data, 94-105, Seattle, Washington, USA, 1998
[20] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, Junjie Wu, and Sen Wu,“Understanding and Enhancement of Internal Clustering Validation Measures”, IEEE Transactions on Cybernetics, VOL. 43(3), 2013
[21] G. Karypis, E.H. Han, V. Kumar, “CHAMELEON: A hierarchical 765 clustering algorithm using dynamic modeling”, IEEE Computers, VOL. 32 (8), 68-75, 1999
[22] Veenman, C.J., M.J.T. Reinders, and E. Backer, “A maximum variance cluster algorithm”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1273-1280, 2002
[23] Chang, H. and D.Y. Yeung, “Robust path-based spectral clustering”, Pattern Recognition, 41(1), 191-203, 2008
[24] Zahn, C.T., “Graph-theoretical methods for detecting and describing gestalt clusters”, IEEE Transactions on Computers, 100(1), 68-86, 1971
[25] Gionis, A., H. Mannila, and P. Tsaparas, “Clustering aggregation”, ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 1-30, 2007
[26] Pasi Fränti and Olli Virmajoki, “Iterative shrinking method for clustering problems”, Pattern Recognition, 39 (5), 761-765, 2006
[27] Jeroen Kools , 6 functions for generating artificial datasets,
URL: http://www.mathworks.com/matlabcentral/fileexchange/41459-6-functions-for-generating-artificial-datasets
[28] Stan Salvador and Philip Chan, “Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms”, IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 576-584, Boca Raton, Florida, USA, 2004
[29] Xiao-Feng Wang and De-Shuang Huang, “A Novel Density-Based Clustering Framework by Using Level Set Method”, IEEE Transactions on Knowledge and Data Engineering, VOL. 21(11), 1515-1531, 2009