簡易檢索 / 詳目顯示

研究生: 謝承翰
Cheng-Han Heish
論文名稱: 基於網格之資料密度分群演算法
A Clustering Algorithm Based on Grids and Their Data Densities
指導教授: 楊英魁
Ying-Kuei Yang
口試委員: 黎碧煌
Bih-Hwang Lee
李建南
Chien-Nan Lee
張博綸
Po-Lun Chang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 82
中文關鍵詞: 資料探勘資料分群密度式分群法網格式分群法混合式分群法群集分析
外文關鍵詞: data mining, data clustering, clustering validation, density-based clustering, grid-based clustering, hybrid clustering
相關次數: 點閱:310下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

隨著現今科技的快速發展,企業公司獲取的資訊越發豐沛,且獲取資訊的管道也越來越多,但是要如何妥善運用這些龐大的資源,使其成為有意義的參考資訊,便需要資料探勘技術的協助,而資料分群即為其中的熱門研究領域,近年來更有多位學者相繼提出新的演算法來改善現存演算法的效率。

常見的分群演算法大致分為階層式、分割式以及密度式,本文研究所提出新的資料分群方法命名為基於網格之資料密度分群法,該演算法以密度式分群演算法DBSCAN為基礎,結合網格式分群法快速定義區域密度的優點,保留DBSCAN演算法中透過擴張來形成群集的概念,目的為加強密度式分群法的分群效能。

一般密度式分群法在分群時,隨機將未處理過的資料點作為初始點,並藉由搜索鄰近區域的單位密度是否符合需求來擴張形成群集,而整個資料空間中,單位密度的大小則是由一開始給予的參數組合決定,且到演算結束這參數組合都不會改變,這種分群方式若遇上群集間距離極近且密度不同的資料集時,可能無法正確進行分群。而本文提出的演算方式,是以密度式分群法為基礎,搭配網格分割的概念,快速計算當前資料空間資料密度最大區域所在,並自此區域開始擴張形成群集,每有一個群集形成,便重新計算資料空間密度最大區域,而後再次擴張,直到終止條件達成。使不同群集間,透過多次迭代而擁有各自獨立的參數組合,能夠處理一些一般密度式分群法所無法處理的資料集,且為了減少人為參數的使用,本文研究還搭配群集分析的概念,令演算法透過分析不同參數搭配得到的分群結果後,自行判斷出分群結果最為良好的參數組合。


Data clustering is one of methods for data mining, and it is also the main topic at this thesis. Data clustering can be divided into several categorizes as Partitional Clustering, Hierarchical Clustering and Density-Based Clustering. Density-Based Clustering can deal with much more kinds of datasets to compare with other methods. However, it has the limit when two clusters have different densities and short distance because the parameters of algorithm are the same.

The main purpose of this thesis is to enhance the efficacy of Density-Based Clustering, and makes the algorithm enable to process two clusters which have different densities and short distance.

The proposed algorithm which is named as A Clustering Algorithm Based on Grids and Their Data Densities in this thesis which is combined Density-based Clustering with Grid-Based Clustering. The algorithm can quickly obtain the density information from datasets by grid. In order to improve the efficacy of clustering, the initial point is set in the region which is the highest density. At the most important point, the proposed algorithm enhances the efficacy of Density-Based Clustering by adjust the parameters according to the density of each cluster.

第一章 緒論 1-1 研究背景與動機 1-2 研究目的 1-3 研究流程 1-4 論文架構 第二章 相關文獻探討 2-1 資料探勘概述 2-1-1 關聯規則(Association Rule) 2-1-2 資料分類(Classification) 2-1-3 資料分群(Clustering) 2-1-4 序列型樣(Sequential Pattern) 2-2 階層式分群法(Hierarchical Clustering) 2-2-1 演算法流程 2-2-2 鄰近值的定義 2-2-3 階層式分群法的優缺點 2-3 分割式分群法(Partitional Clustering) 2-3-1 演算法流程 2-3-2 分割式演算法的優缺點 2-4 密度式分群法(Density-Based Clustering) 2-4-1 核心點、邊界點以及雜訊點 2-4-2 演算法流程 2-4-3 參數的選擇 2-4-4 密度式分群法的優缺點 2-5 以網格為基礎的分群方式(Grid-Based Clustering) 2-5-1 演算法流程 2-5-2 設定網格 2-5-3 網格式分群法的優缺點 第三章 基於網格之資料密度分群演算法 3-1 演算法概念 3-2 演算法中網格的設定 3-2-1 網格大小隨資料空間做變化 3-2-2 網格大小不隨資料空間做變化 3-2-3 網格大小的取得 3-2-4 資料點位在網格之上的情形 3-3 演算法流程 3-3-1 演算法參數設定 3-3-2 套用網格尋找最大區域密度 3-3-3 自密度最大的區域尋找初始點 3-3-4 求取Eps 及 MinPts 3-3-5 搜尋核心點以及邊界點 3-3-6 終止條件 3-3-7 判斷雜訊 3-4 演算法流程圖 3-5 自動判斷最少資料點權重值 3-5-1 CVNN 3-5-2 自動判斷最佳分群結果 3-5-3 演算法參數設定 3-5-4 判斷條件 3-5-5 使用CVNN進行自動判斷 第四章 實驗 4-1 實驗環境 4-2 實驗資料集 4-3 網格數量提升對於群集邊界的影響 4-4 實驗結果分析 4-4-1 實驗結果比較 4-5 與作為雛形的密度式分群法做效能評估 4-5-1 實驗一 4-5-2 實驗二 4-5-3 實驗三 4-6 分群和預期不同資料集之說明 4-7 自動判斷最少資料點權重值 4-7-1 以CVNN自動判斷權值實驗結果 4-7-2 自動判斷權值和人為給予權值的結果差異 4-8 更改判斷權值的演算法 4-8-2 更改CVNN前後實驗結果比較 4-9 針對實驗4-8更改權值演算法 4-9-1 緊密度改用群集資料點平均最遠距離 4-9-2 針對實驗4-8做更動後實驗結果 4-9-3 比較與探討 4-10 實驗結論 第五章 結論與未來展望 5-1 研究結論 5-2 未來研究方向 參考文獻 附錄A - K-means演算法實驗結果 附錄B - DBSCAN演算法實驗結果 附錄C - Hierarchical演算法實驗結果 附錄D - 基於網格之資料密度分群法實驗結果

[1] Pang-Ning Tan, Michael Steinbach and Vipn Kumar, 資料探勘, 台灣培生出版, 2008

[2] Ming-Syan Chen, Jiawei Han and Philip S. Yu, “Data mining: An Overview from a Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, VOL. 8(6), 1996

[3] J. W. Han and M. Kanber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001

[4] Rakesf Agrawal, Tomasz Imieliński and Arun Swami, “Mining association rules between sets of items in large databases”, ACM SIGMOD international conference on Management of data, 207-216, Washington, D.C., 1993

[5] R. Agrawal and R. Srikant, “Mining Sequential Patterns”, International Conference on Data Engineering, 3-14, Taipei, Taiwan, 1995

[6] MacQueen J. B. “Some Methods for classification and Analysis of Multivariate Observations”, Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 281-297, Statistical Laboratory of the University of California, Berkeley, 1967

[7] Martin Ester, Hans-Peter Kriegel, Jiirg Sander and Xiaowei Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, International Conference on Knowledge Discovery and Data mining, 226-231, Oregon, Portland, 1996

[8] Dongming Chen, Yun Yan and Dongqi Wang, “Density Clustering Based on Border-Expanding”, International Conference on Natural Computation, 670-674, Xiamen, China, 2014

[9] Dominik Kellner, Jens Klappstein and Klaus Dietmayer, “Grid-Based DBSCAN for Clustering Extended Objects in Radar Data” , IEEE Intelligent Vehicles Symposium (IV), 365-370, Madrid, Spain, 2012

[10] Linmeng Zhang, Zhigao Xu and Fengqi Si, “GCMDDBSCAN: Multi-Density DBSCAN Based on Grid and Contribution”, IEEE International Conference on Dependable, Autonomic and Secure Computing, 502-507, Chengdu, Sichuan, China, 2013

[11] Chen Xiaoyun, Min Yufang, Zhao Yan and Wang Ping, “GMDBSCAN: Multi-Density DBSCAN Cluster Based on Grid”, IEEE International Conference on e-Business Engineering, 780-783, Xian, China, 2008

[12] 王繶梃, 「一個使用網格空間切割技術與改良式取樣概念之密度式分群演算法」, 屏東科技大學, 資訊管理學系, 碩士論文, 2011

[13] 陳玟君, 「在動態資料庫中以較少運算重建頻繁樣式樹之方法」, 台灣科技大學, 電機工程學系, 碩士論文, 2014

[14] 張智星, 資料分群與樣式辨認, URL: http://mirlab.org/jang/books/dcpr/

[15] Tian Zhang, Raghu Ramakrishnan and Miron Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases”, ACM SIGMOD international conference on Management of data, 103-114, Montreal, Quebec, Canada, 1996

[16] Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, “CURE: An Efficient Clustering Algorithm for Large Databases”, ACM SIGMOD international conference on Management of data, 73-84, Seattle, Washington, USA, 1998

[17] George Karypis, Eui-Hong Han and Vipin Kumar, “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer, VOL. 32(8), 68-75, 1999

[18] Wei Wang, Jiong Yang and Richard Muntz, “STING : A Statistical Information Grid Approach to Spatial Data mining”, International Conference on Very Large Data Bases, 186-195, Athens, Greece, 1997

[19] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos and Prabhakar Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications”, ACM SIGMOD international conference on Management of data, 94-105, Seattle, Washington, USA, 1998

[20] Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, Junjie Wu, and Sen Wu,“Understanding and Enhancement of Internal Clustering Validation Measures”, IEEE Transactions on Cybernetics, VOL. 43(3), 2013

[21] G. Karypis, E.H. Han, V. Kumar, “CHAMELEON: A hierarchical 765 clustering algorithm using dynamic modeling”, IEEE Computers, VOL. 32 (8), 68-75, 1999

[22] Veenman, C.J., M.J.T. Reinders, and E. Backer, “A maximum variance cluster algorithm”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1273-1280, 2002

[23] Chang, H. and D.Y. Yeung, “Robust path-based spectral clustering”, Pattern Recognition, 41(1), 191-203, 2008

[24] Zahn, C.T., “Graph-theoretical methods for detecting and describing gestalt clusters”, IEEE Transactions on Computers, 100(1), 68-86, 1971

[25] Gionis, A., H. Mannila, and P. Tsaparas, “Clustering aggregation”, ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 1-30, 2007

[26] Pasi Fränti and Olli Virmajoki, “Iterative shrinking method for clustering problems”, Pattern Recognition, 39 (5), 761-765, 2006

[27] Jeroen Kools , 6 functions for generating artificial datasets,
URL: http://www.mathworks.com/matlabcentral/fileexchange/41459-6-functions-for-generating-artificial-datasets

[28] Stan Salvador and Philip Chan, “Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms”, IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 576-584, Boca Raton, Florida, USA, 2004

[29] Xiao-Feng Wang and De-Shuang Huang, “A Novel Density-Based Clustering Framework by Using Level Set Method”, IEEE Transactions on Knowledge and Data Engineering, VOL. 21(11), 1515-1531, 2009

無法下載圖示 全文公開日期 2021/06/16 (校內網路)
全文公開日期 2026/06/16 (校外網路)
全文公開日期 2021/06/16 (國家圖書館:臺灣博碩士論文系統)
QR CODE