簡易檢索 / 詳目顯示

研究生: 張家榮
Chia-jung Chang
論文名稱: 以距離與餘弦夾角為基礎之創新群集方法研究
Distance and Cosine Angle-Based Novel Clustering Techniques
指導教授: 鄭明淵
Min-Yuan Cheng
口試委員: 郭斯傑
Sy-Jye Guo
蘇振維
Cheng-Wei Su
謝佑明
Yo-Ming Hsieh
學位類別: 碩士
Master
系所名稱: 工程學院 - 營建工程系
Department of Civil and Construction Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 中文
論文頁數: 101
中文關鍵詞: 群集分析Fuzzy C-meansK-means餘弦相似度
外文關鍵詞: Cluster Analysis, Fuzzy C-means, K-means, Cosine Similarity
相關次數: 點閱:275下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在資料探勘領域中,群集分析(cluster analysis)為資料預處理的重要方法,而在群集分析中的最常使用的方法之一,為K-means演算法,K-means分群法是以歐幾里得距離(Euclidean distance)作為分群依據,雖然使用歐幾里得距離能夠體現個體數值特徵的絕對差異,但如果發生距離非常接近甚至相同時,可能產生難以判斷分群的結果。而由K-means演算法衍生而來的Fuzzy C-means演算法,透過模糊理論的概念,以隸屬程度來表現出每個物件屬於各群集的程度,但隸屬程度的判斷仍然只依靠歐幾里得距離,可能會產生與上述相同的問題。此外,餘弦相似度(cosine similarity)也是常被採用的度量方法之一,當使用餘弦相似度來衡量資料間相似度的大小時,由於餘弦相似度只單獨從方向性上區分差異,對於絕對數值並不敏感,仍有其盲點與缺陷存在。
    因此,本研究希望對上述不足之處進行探討,希望改善K-means及Fuzzy C-means演算法,探討歐幾里得距離與餘弦相似度兩種衡量方法的特性,使分群演算法同時將兩者作為分群依據,發展出θ-means演算法與Fuzzy θ-means演算法來改善上述之問題。


    Clustering also called unsupervised learning approach, is an important and fundamental task in data mining. This approach aims to divide data into groups of similar objects. K-means is a popular method of clustering methods that partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster based on the Euclidean distance. Fuzzy C-means method was thus proposed to resolve this problem that uses the concept of fuzzy theory with the membership degree, thus allows data to belong to several groups. However, Fuzzy C-means method is only able to partially problem because it still relies on the Euclidean distance in the data-partitioning process. Cosine similarity method has been broadly used to measure the similarity degree among data patterns by distinguish the difference among data based on direction. Hence, this cosine similarity is not affected by the Euclidean distance. Combining K-means clustering and Fuzzy C-means clustering with cosine similarity may bring efficient solution to handle the Euclidean distance-related problems. The objective of this study is to discuss the shortcomings of the above-mentioned methods and to establish two newly unsupervised learning models, including θ-means algorithm and fuzzy θ-means which are hybrids of K-means and Fuzzy C-means clustering with cosine similarity method, respectively.

    第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 5 1.3 研究範圍與限制 7 1.4 研究流程與方法 8 1.5 論文架構 11 第二章 文獻回顧 12 2.1 資料探勘 12 2.1.1 資料探勘的功能 14 2.2 群集分析 17 2.2.1 分割式分群法 19 2.2.2 階層式分群法 20 2.3 K-means演算法 21 2.3.1 K-means演算法步驟 22 2.3.2 K-means演算法之特性 27 2.4 Fuzzy C-means演算法 28 2.4.1 Fuzzy C-means演算法步驟 29 2.5 度量方法介紹 31 第三章 θ-means演算法 34 3.1 歐幾里得距離與餘弦相似度之問題探討 34 3.2 K-means問題探討 36 3.3 θ-means演算法 37 3.4 θ-means演算法特性 42 第四章 Fuzzy θ-means演算法 45 4.1 Fuzzy C-means問題探討 45 4.2 Fuzzy θ-means演算法 47 4.3 Fuzzy θ-means演算法特性 52 第五章 案例測試與分析 55 5.1 工程爭議案例 55 5.2 工程專案成功度案例 60 5.3 θ-means案例測試與分析 66 5.3.1 θ-means演算法-工程爭議案例 67 5.3.2 θ-means演算法-工程專案成功度案例 73 5.4 Fuzzy θ-means案例測試與分析 78 5.4.1 Fuzzy θ-means演算法-工程爭議案例 79 5.4.2 Fuzzy θ-means演算法-工程專案成功度案例 83 第六章 結論與建議 87 6.1 結論 87 6.2 建議 88 參考文獻 89 附錄 92

    1.Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006.

    2.Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. "From data mining to knowledge discovery in databases." AI magazine 17.3 (1996): 37.

    3.MacQueen, James. "Some methods for classification and analysis of multivariate observations." Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 281-297. 1967.

    4.Huang, Zhexue. "Extensions to the k-means algorithm for clustering large data sets with categorical values." Data Mining and Knowledge Discovery 2.3 (1998): 283-304.

    5.Ray, Siddheswar, and Rose H. Turi. "Determination of number of clusters in k-means clustering and application in colour image segmentation." Proceedings of the 4th international conference on advances in pattern recognition and digital techniques. 1999.

    6.Hruschka, Harald, and Martin Natter. "Comparing performance of feedforward neural nets and K-means for cluster-based market segmentation." European Journal of Operational Research 114.2 (1999): 346-353.

    7.Bezdek, James C., Robert Ehrlich, and William Full. "FCM: The fuzzy c-means clustering algorithm." Computers & Geosciences 10.2 (1984): 191-203.

    8.Lim, Young Won, and Sang Uk Lee. "On the color image segmentation algorithm based on the thresholding and the fuzzy c-means techniques."Pattern Recognition 23.9 (1990): 935-952.

    9.Goktepe, A. B., S. Altun, and A. Sezer. "Soil clustering by fuzzy c-means algorithm." Advances in Engineering Software 36.10 (2005): 691-698.

    10.Gnardellis, T., and B. Boutsinas. "On experimenting with data mining in education." Paper preseted at the 2ο Πανελλήνιο Συνέδριο με ιεθνή Συμμετοχή(2001).

    11.Berry, Michael J., and Gordon Linoff. Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc., 1997.

    12.Cabena, P. et. al., Discovering Data Mining: From Concept to Implementation, Prentice Hall, 1997.

    13.Kleissner, Charly. "Data mining for the enterprise." System Sciences, 1998., Proceedings of the Thirty-First Hawaii International Conference on. Vol. 7. IEEE, 1998.

    14.Shaw, Michael J., et al. "Knowledge management and data mining for marketing." Decision support systems 31.1 (2001): 127-137.

    15.張正樺,「用 K-means 方法於時域特徵之國語數字辨認」,中興大學應用數學研究所碩士論文,2004。

    16.張云濤,龔玲,「資料探勘原理與技術」,五南圖書出版股份有限公司,2007。

    17.毛國君,段立娟,王實與石雲,「數據挖掘原理與算法」清華大學出版社有限公司,2005,160-163。

    18.Salton, Gerard, and Michael E. Lesk. "Computer evaluation of indexing and text processing." Journal of the ACM (JACM) 15.1 (1968): 8-36.

    19.Huang, Anna. "Similarity measures for text document clustering." Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. 2008.
    20.Punj, Girish, and David W. Stewart. "Cluster analysis in marketing research: review and suggestions for application." Journal of marketing research (1983): 134-148.

    21.張紘愷,「應用分群技術於資料探勘之研究」,國立高雄應用科技大學電子與資訊工程研究所碩士論文,2004。

    22.林育臣,「群聚技術之研究」,朝陽科技大學資訊管理研究所碩士論文,2002。

    23.Hartigan J. Clustering algorithms. JohnWiley and Sons Inc. 1975.

    24.Huang Z. Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the First Pacific-Asia conference on knowledge discovery and data mining 1997.

    25.邱義翔,「工程爭議案例推論模式之研究」,國立台灣科技大學碩士論文,2005。

    26.陳弼宏,「專案成功度動態預測-應用演化式模糊類神經推論模式」,國立台灣科技大學碩士論文,2003。

    27.Russell, J. S., Jaselskis, E. J., and Lawrence, S. P. (1997). “Continuous Assessment of Project Performance.” Journal of Construction Engineering and Management, ASCE, 123(1), 64–71.

    28.張閔嘉,「智慧型節能技術:以感測網路自動偵測異常空調狀態之研究」,國立台灣大學碩士論文,2011。

    QR CODE