研究生: |
張家榮 Chia-jung Chang |
---|---|
論文名稱: |
以距離與餘弦夾角為基礎之創新群集方法研究 Distance and Cosine Angle-Based Novel Clustering Techniques |
指導教授: |
鄭明淵
Min-Yuan Cheng |
口試委員: |
郭斯傑
Sy-Jye Guo 蘇振維 Cheng-Wei Su 謝佑明 Yo-Ming Hsieh |
學位類別: |
碩士 Master |
系所名稱: |
工程學院 - 營建工程系 Department of Civil and Construction Engineering |
論文出版年: | 2014 |
畢業學年度: | 102 |
語文別: | 中文 |
論文頁數: | 101 |
中文關鍵詞: | 群集分析 、Fuzzy C-means 、K-means 、餘弦相似度 |
外文關鍵詞: | Cluster Analysis, Fuzzy C-means, K-means, Cosine Similarity |
相關次數: | 點閱:275 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資料探勘領域中,群集分析(cluster analysis)為資料預處理的重要方法,而在群集分析中的最常使用的方法之一,為K-means演算法,K-means分群法是以歐幾里得距離(Euclidean distance)作為分群依據,雖然使用歐幾里得距離能夠體現個體數值特徵的絕對差異,但如果發生距離非常接近甚至相同時,可能產生難以判斷分群的結果。而由K-means演算法衍生而來的Fuzzy C-means演算法,透過模糊理論的概念,以隸屬程度來表現出每個物件屬於各群集的程度,但隸屬程度的判斷仍然只依靠歐幾里得距離,可能會產生與上述相同的問題。此外,餘弦相似度(cosine similarity)也是常被採用的度量方法之一,當使用餘弦相似度來衡量資料間相似度的大小時,由於餘弦相似度只單獨從方向性上區分差異,對於絕對數值並不敏感,仍有其盲點與缺陷存在。
因此,本研究希望對上述不足之處進行探討,希望改善K-means及Fuzzy C-means演算法,探討歐幾里得距離與餘弦相似度兩種衡量方法的特性,使分群演算法同時將兩者作為分群依據,發展出θ-means演算法與Fuzzy θ-means演算法來改善上述之問題。
Clustering also called unsupervised learning approach, is an important and fundamental task in data mining. This approach aims to divide data into groups of similar objects. K-means is a popular method of clustering methods that partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster based on the Euclidean distance. Fuzzy C-means method was thus proposed to resolve this problem that uses the concept of fuzzy theory with the membership degree, thus allows data to belong to several groups. However, Fuzzy C-means method is only able to partially problem because it still relies on the Euclidean distance in the data-partitioning process. Cosine similarity method has been broadly used to measure the similarity degree among data patterns by distinguish the difference among data based on direction. Hence, this cosine similarity is not affected by the Euclidean distance. Combining K-means clustering and Fuzzy C-means clustering with cosine similarity may bring efficient solution to handle the Euclidean distance-related problems. The objective of this study is to discuss the shortcomings of the above-mentioned methods and to establish two newly unsupervised learning models, including θ-means algorithm and fuzzy θ-means which are hybrids of K-means and Fuzzy C-means clustering with cosine similarity method, respectively.
1.Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006.
2.Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From data mining to knowledge discovery in databases." AI magazine 17.3 (1996): 37.
3.MacQueen, James. "Some methods for classification and analysis of multivariate observations." Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 281-297. 1967.
4.Huang, Zhexue. "Extensions to the k-means algorithm for clustering large data sets with categorical values." Data Mining and Knowledge Discovery 2.3 (1998): 283-304.
5.Ray, Siddheswar, and Rose H. Turi. "Determination of number of clusters in k-means clustering and application in colour image segmentation." Proceedings of the 4th international conference on advances in pattern recognition and digital techniques. 1999.
6.Hruschka, Harald, and Martin Natter. "Comparing performance of feedforward neural nets and K-means for cluster-based market segmentation." European Journal of Operational Research 114.2 (1999): 346-353.
7.Bezdek, James C., Robert Ehrlich, and William Full. "FCM: The fuzzy c-means clustering algorithm." Computers & Geosciences 10.2 (1984): 191-203.
8.Lim, Young Won, and Sang Uk Lee. "On the color image segmentation algorithm based on the thresholding and the fuzzy c-means techniques."Pattern Recognition 23.9 (1990): 935-952.
9.Goktepe, A. B., S. Altun, and A. Sezer. "Soil clustering by fuzzy c-means algorithm." Advances in Engineering Software 36.10 (2005): 691-698.
10.Gnardellis, T., and B. Boutsinas. "On experimenting with data mining in education." Paper preseted at the 2ο Πανελλήνιο Συνέδριο με ιεθνή Συμμετοχή(2001).
11.Berry, Michael J., and Gordon Linoff. Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc., 1997.
12.Cabena, P. et. al., Discovering Data Mining: From Concept to Implementation, Prentice Hall, 1997.
13.Kleissner, Charly. "Data mining for the enterprise." System Sciences, 1998., Proceedings of the Thirty-First Hawaii International Conference on. Vol. 7. IEEE, 1998.
14.Shaw, Michael J., et al. "Knowledge management and data mining for marketing." Decision support systems 31.1 (2001): 127-137.
15.張正樺,「用 K-means 方法於時域特徵之國語數字辨認」,中興大學應用數學研究所碩士論文,2004。
16.張云濤,龔玲,「資料探勘原理與技術」,五南圖書出版股份有限公司,2007。
17.毛國君,段立娟,王實與石雲,「數據挖掘原理與算法」清華大學出版社有限公司,2005,160-163。
18.Salton, Gerard, and Michael E. Lesk. "Computer evaluation of indexing and text processing." Journal of the ACM (JACM) 15.1 (1968): 8-36.
19.Huang, Anna. "Similarity measures for text document clustering." Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. 2008.
20.Punj, Girish, and David W. Stewart. "Cluster analysis in marketing research: review and suggestions for application." Journal of marketing research (1983): 134-148.
21.張紘愷,「應用分群技術於資料探勘之研究」,國立高雄應用科技大學電子與資訊工程研究所碩士論文,2004。
22.林育臣,「群聚技術之研究」,朝陽科技大學資訊管理研究所碩士論文,2002。
23.Hartigan J. Clustering algorithms. JohnWiley and Sons Inc. 1975.
24.Huang Z. Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the First Pacific-Asia conference on knowledge discovery and data mining 1997.
25.邱義翔,「工程爭議案例推論模式之研究」,國立台灣科技大學碩士論文,2005。
26.陳弼宏,「專案成功度動態預測-應用演化式模糊類神經推論模式」,國立台灣科技大學碩士論文,2003。
27.Russell, J. S., Jaselskis, E. J., and Lawrence, S. P. (1997). “Continuous Assessment of Project Performance.” Journal of Construction Engineering and Management, ASCE, 123(1), 64–71.
28.張閔嘉,「智慧型節能技術:以感測網路自動偵測異常空調狀態之研究」,國立台灣大學碩士論文,2011。