Basic Search / Detailed Display

Author: 莊峻豪
Chun-hao Chuang
Thesis Title: 以資料密度為基礎的多階段分群演算法
A Density-based Multistage Clustering Algorithm
Advisor: 徐俊傑
Chiun-chieh Hsu
Committee: 賴源正
Yuan-cheng Lai
羅乃維
Nai-wei Lo
Degree: 碩士
Master
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2010
Graduation Academic Year: 98
Language: 中文
Pages: 54
Keywords (in Chinese): 密度基礎減法分群模糊分群階層式分群
Keywords (in other languages): Density-based, Subtractive clustering, Fuzzy clustering, Hierarchical clustering
Reference times: Clicks: 254Downloads: 4
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

  近年來隨著電子商務的興起,許多企業紛紛投入電子化,企業們所蒐集到的資料越來越多,對於管理者來說,若能從這些資料中發掘出有用的資訊,將會有莫大的助益。因此,資料探勘已經成為相當重要而且熱門的研究領域之一。

  分群演算法在不需事先定義類別下,依照資料本身特性集結成群,可幫助我們區隔且辨識出不同的資料關係。所以,分群演算法是其中重要的一環,其目標是資料分群後,群內資料同質性很高,而群與群之間異質性也很高。以目標函數為基礎的模糊分群演算法,其需事先指定欲分群的數量,但即便是指定群數後,因初始群心位置採用隨機方式決定,導致每次分群結果不一定相同,結果也可能落入區域最佳解,若資料中包含了雜訊,對於結果的影響更為顯著,不可不慎選。雖此一方法可搭配減法演算法來減少問題的發生,但對於非凸形資料來說,較無法描繪出資料形狀,分群結果也不慎理想。若改以模糊分群演算法搭配階層式演算法,雖可描繪出任意形狀群集,但依然容易受到雜訊影響,導致結果不佳。

  因此,本論文提出一種多階段分群演算法,結合減法演算法、模糊演算法以及階層式演算法之優點,並修改減法演算法以多群技巧及過濾雜訊來解決上述問題。本演算法包含三個階段。第一階段為初始群心挑選階段,又可細分成三階段:挑選密度明顯較高的資料點為群心、挑選密度中低但距離其它群心較遠的資料點為群心、過濾雜訊點並依使用者指定而產生較多的群心;第二階段為群集修正階段使用模糊分群演算法,根據上一階段挑選出的群心作為此階段的輸入,演算法進行疊代運算直到收斂;第三階段為群集合併階段,以單一連結方式將多個群集合併,直到剩下合適的群數為止。根據實驗結果顯示,本論文提出之方法能夠提升準確率,並改善先前方法之缺點。


  With the increase of the e-commerce in recent years, large amounts of enterprises start to invest in computerization and thus generate a tremendous amount of data. For the managers, it will be a great benefit for the enterprise if useful information can be extracted from these raw data. Therefore, data mining has become one of important and popular research domains.

  Clustering algorithms can recognize and partition data according to their attributes’ characteristics without defining any categorization information in advance. Therefore, clustering algorithms play an important role in data mining, where the goal is to maximize the homogeneity of objects within the clusters while also maximize the heterogeneity between clusters. Fuzzy C-Means is one of the most popular clustering algorithms, where the number of clusters should be given in advance. Even though we assign the number of clusters, the result of clustering may fall into the local optimum. In addition, because the initial cluster centers are determined by random, the result of each execution may be different. Furthermore, if the data contains noise, it will induce more significant impact on the results, so it is important to suitably select the initial cluster centers. Although the influence of randomly selecting centers can be reduced if the subtractive method is adopted, it is still difficult to deal with the non-spherical shape clusters. If we use the FCM algorithm with the hierarchical algorithm, it can deal with the non-spherical shape clusters, but it cannot easily handle noise.

  Hence, this paper proposes a density-based multistage clustering algorithm, combining with subtractive clustering, fuzzy clustering and hierarchical clustering methods for solving the problems mentioned above. There are three stages in the approach. The first stage is to suitably select the initial cluster centers; the second stage is to modify the distribution of data points; and the third stage is to merge clusters until appropriate number of clusters is achieved. The experimental results show that our proposed method can improve the performance of clustering.

中文摘要 I 英文摘要 II 目錄 IV 圖索引 VI 表索引 VIII 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的 3 1.4 研究範圍與限制 3 1.5 論文架構 3 第二章 文獻探討 4 2.1 分群技術 4 2.2 分割式分群演算法 7 2.3 階層式分群演算法 8 2.4 以密度為基礎的分群演算法 10 2.5 以網格為基礎的分群演算法 12 2.6 模糊分群演算法 13 2.7 混合式分群演算法 15 2.8 群集有效性分析 16 第三章 研究方法 19 3.1 問題描述 19 3.2 DMC演算法 20 3.2.1初始群心挑選階段 23 3.2.2 群集修正階段 25 3.2.3 群集合併階段 26 3.3 完整範例 27 3.4 參數設定說明 32 第四章 實驗分析與結果 33 4.1 實驗資料與環境 33 4.2 FCM、SFCM與DMC之比較 35 4.3 HFCM與DMC之比較 44 4.4 實驗總結 49 第五章 結論與未來研究方向 51 參考文獻 52

[1] Bezdek , J. C., "Cluster Validity with Fuzzy Sets", Journal of Cybernetics, vol. 3, pp. 58-73 (1973).
[2] Bezdek, J. C., "Numerical taxonomy with fuzzy sets", Journal of Mathematical Biology, vol. 1, pp. 57-71 (1974).
[3] Bezdek, J. C., Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York (1981).
[4] Bezdek, J. C., Ehrlich, R., and Full, W., "FCM: The fuzzy c-means clustering algorithm", Computers & Geosciences, vol. 10, pp. 191-203 (1984).
[5] Chiu, S. L., "Fuzzy model identification based on cluster estimation", Journal of Intelligent Fuzzy Systems, vol. 2, pp. 267-278 (1994).
[6] Dave, R. N., "Validating fuzzy partitions obtained through c-shells clustering", Pattern Recognition Letters, vol. 17, pp. 613-623 (1996).
[7] Dragut, A. B. and Nichitiu, C. M., "A Monotonic On-Line Linear Algorithm for Hierarchical Agglomerative Classification", Information Technology and Management, vol. 5, pp. 111-141 (2004).
[8] Dubes, R. C., "How many clusters are best? - An experiment", Pattern Recognition, vol. 20, pp. 645-663 (1987).
[9] Dunn, J. C., "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics, vol. 3, pp. 32 - 57 (1973).
[10] Fukuyama, Y. and Sugeno, M., "A new method of choosing the number of clusters for the fuzzy c-means method", in Proceedings of Fifth Fuzzy Systems Symposium, pp. 247-250 (1989).
[11] Gan, G., Ma, C., and Wu, J., Data Clustering: Theory, Algorithms, and Applications, SIAM (2007).
[12] Gao, X. and Xie, W., "Advances in theory and applications of fuzzy clustering", Chinese Science Bulletin, vol. 45, pp. 961-970 (2000).
[13] GuldemIr, H. and Sengur, A., "Comparison of clustering algorithms for analog modulation classification", Expert Systems with Applications, vol. 30, pp. 642-649 (2006).
[14] Hammouda, K. and Karay, F., "A Comparative Study of Data Clustering Techniques", Course Project (2000).
[15] Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco (2001).
[16] Kuik, S.P. and Salim, N., "Optimized Subtractive Clustering for Cluster-Based Compound Selection", in Proceedings of the 1st International Conference on Natural Resources Engineering & Technology, Putrajaya, Malaysia, pp. 492-499 (2006).
[17] Li, Y. and Chung, S., "Parallel bisecting k-means with prediction clustering algorithm", The Journal of Supercomputing, vol. 39, pp. 19-37 (2007).
[18] Liu, W. Y., Xiao, C. J., Wang, B. W., Shi, Y., and Fang, S. F., "Study on combining subtractive clustering with fuzzy c-means clustering", in Proceedings of the Second International Conference on Machine Learning and Cybernetics, vol.5, pp. 2659-2662 (2003).
[19] Pakhira, M. K., Bandyopadhyay, S., and Maulik, U., "Validity index for crisp and fuzzy clusters", Pattern Recognition, vol. 37, pp. 487-501 (2004).
[20] Pakhira, M. K., Bandyopadhyay, S., and Maulik, U., "A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification", Fuzzy Sets and Systems, vol. 155, pp. 191-214 (2005).
[21] Pal, N. R. and Bezdek, J. C., "On cluster validity for the fuzzy c-means model", Fuzzy Systems, IEEE Transactions on, vol. 3, pp. 370-379 (1995).
[22] Wong, C. C., Chen, C. C., and Su, M. C., "A novel algorithm for data clustering", Pattern Recognition, vol. 34, pp. 425-442 (2001).
[23] Xie, X. L. and Beni, G., "A validity measure for fuzzy clustering", Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 13, pp. 841-847 (1991).
[24] Yager, R. R. and Filev, D. P., "Approximate clustering via the mountain method", Systems, Man and Cybernetics, IEEE Transactions on, vol. 24, pp. 1279-1284 (1994).
[25] Yu, J., Cheng, Q., and Huang, H., "Analysis of the weighting exponent in the FCM", IEEE Transactions on Systems, Man and Cybernetics -Part B: Cybernetics, vol. 34, no. 1, pp. 634-639 (2004).
[26] Zadeh, L. A., "Fuzzy sets", Information and Control, vol. 8, pp. 338-353 (1965).
[27] 陳同孝、陳雨霖、劉明山、許文綬、林志強、邱永興,「結合K-means及階層式分群法之二階段分群演算法」,電腦學刊,第十七卷,第一期,第65-76頁 (2006)。

QR CODE