簡易檢索 / 詳目顯示

研究生: Maisyatus Suadaa Irfana
Maisyatus - Suadaa Irfana
論文名稱: 在完整必連限制條件下的資料分群
Data Clustering with Complete Must-Link Constraints
指導教授: 楊朝龍
Chao-Lung Yang
口試委員: 王孔政
Kung-Jeng Wang
郭人介
Ren-Jieh Kuo
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 67
中文關鍵詞: 限制分群法完整必連限制主成份分析資料重疊
外文關鍵詞: Constrained Clustering, Complete Must Link, Principal Component Analysis, Data Overlapping
相關次數: 點閱:344下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究的目標在發展一個整合型的資料分群方法來解決一種常見的限制分群問題(constrained clustering problem), 完整必連(Complete Must-Link, CML)限制問題。限制分群法屬於半監督學習方法(semi-supervised learning),它著重於當資料之間的群組關係為已知時,如何根據其形成的限制來增進分群分析的效率及正確性。完整必連(Complete Must-Link, CML)限制指得是當所有的資料均事先隸屬於一個特定次群集,此一CML限制會強制資料分群時,需將相同的CML群集歸於同群。當使用傳統的限制分群法來處理CML限制問題時,因CML次群資料會被其質量中心點所取代而導致資訊遺失的問題。此一問題在次群集資料相互重疊時更顯得嚴重。本研究提出一個新的分群方法來改善限制分群方法。首先,我們提出一個衡量CML次群集重疊程度的方法(overlapped ratio)。當CML次群集重疊程度高時,考慮在相似矩陣中加入能代表各個資料群集特性的主成份分數。並以此相似矩陣,運用傳統的分群方法重新找出分群結果。此研究以電腦模擬及實際收集的數據將被用來分析新分群法的效能。實驗結果證實所提出的方法能有效解決CML限制資料分群時的資料遺失問題並取得較佳的分群結果。


    This research aims to develop an integrated method to solve a special but not uncommon constrained clustering problem constructed by Complete Must-Link (CML) constraints. Constrained clustering analysis is a semi-supervised learning to accommodate the information while it is available, to improve efficiency and purity of clustering. CML clustering problem can be considered as aggregating pre-defined data groups. Through the transitive closure process of data aggregation, the data of each group is replaced by their centroid for clustering analysis. This causes information missing issue which means the data distribution or shape of original group is omitted, especially when the group is overlapped with each other. In this research, in order to overcome this problem, a new method named PCA-CML is proposed for CML constrained clustering problem. The principal component analysis (PCA) which provides the supplemental information describing original partition blocks is suggested to be included in the distance matrix of the constrained clustering algorithm if they are overlapped each other. Overlapped ratio is invented to determine whether CML data partitions are overlapped or not. We test the proposed algorithm using the simulated dataset that consists of overlapped and non-overlapped dataset, and real-world dataset containing cartridge quality information. From the experimental result, we can conclude that the proposed algorithm can alleviate missing information issue in CML constrained clustering when pre-defined CML partitions are overlapped.

    Table 1. Comparison of clustering algorithms12 Table 2. Covariance Matrix 3 Dimensional17 Table 3. Notation of Overlapped Ratio23 Table 4. Experiment Result Non-overlapped dataset36 Table 5. Experiment Result Overlapped dataset38 Table 6. Experiment Result Cartridge dataset42 Table 7. Experimental Result of Yeast dataset44 Table 8. CML Information:45 Table 9. Experimental Result of Auto MPG dataset46

    [1]"Clustering," in Encyclopedia of Machine Learning, C. Sammut and G. Webb, Eds., ed: Springer US, 2010, pp. 180-180.
    [2]Z. Ghahramani, "Unsupervised Learning," in Advanced Lectures on Machine Learning. vol. 3176, O. Bousquet, U. Luxburg, and G. Ratsch, Eds., ed: Springer Berlin Heidelberg, 2004, pp. 72-112.
    [3]A. Jain, "Data Clustering: 50 Years Beyond K-means," in Machine Learning and Knowledge Discovery in Databases. vol. 5211, W. Daelemans, B. Goethals, and K. Morik, Eds., ed: Springer Berlin Heidelberg, 2008, pp. 3-4.
    [4]I. D. S. Basu, K. Wagstaff, Clustering with Constraints: Algorithms, Applications: Chapman & Hall/CRC Press, 2008.
    [5]H. Jiang, Z. Ren, J. Xuan, and X. Wu, "Extracting elite pairwise constraints for clustering," Neurocomputing, vol. 99, pp. 124-133, 1/1/ 2013.
    [6]Z. Shaohong and W. Hau-San, "Active constrained clustering with multiple cluster representatives," in Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on, 2009, pp. 2689-2694.
    [7]K. Wagstaff, C. Cardie, S. Rogers, S. Schr, #246, and dl, "Constrained K-means Clustering with Background Knowledge," presented at the Proceedings of the Eighteenth International Conference on Machine Learning, 2001.
    [8]K. Wagstaff and C. Cardie, "Clustering with Instance-level Constraints," presented at the Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
    [9]X. Xiao-Hua, P. Zhou-Jin, H. Ping, and C. Ling, "Constrained ant clustering," in Machine Learning and Cybernetics (ICMLC), 2011 International Conference on, 2011, pp. 1566-1570.
    [10]M. N. M. A.K. JAIN, P.J. FLYNN, "Data Clustering: A Review," ACM Computing Surveys, 31(3). 2000.
    [11]M. C. Law, A. Topchy, and A. Jain, "Clustering with Soft and Group Constraints," in Structural, Syntactic, and Statistical Pattern Recognition. vol. 3138, A. Fred, T. Caelli, R. W. Duin, A. Campilho, and D. de Ridder, Eds., ed: Springer Berlin Heidelberg, 2004, pp. 662-670.
    [12]I. Davidson and S. S. Ravi, "Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results," in Knowledge Discovery in Databases: PKDD 2005. vol. 3721, A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, Eds., ed: Springer Berlin Heidelberg, 2005, pp. 59-70.
    [13]A. Y. N. E. P. Xing, M. I. Jordan, and S. Russell, "Distance metric learning, with application to clustering with side-information," presented at the in Advances in Neural Information Processing Systems 15, 2002.
    [14]Z. Li and L. Tao, "Semi-supervised Hierarchical Clustering," in Data Mining (ICDM), 2011 IEEE 11th International Conference on, 2011, pp. 982-991.
    [15]J. B. MacQueen, "Some Methods for Classification and Analysis of MultiVariate Observations," presented at the Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967.
    [16]J. Wu, "Cluster Analysis and K-means Clustering: An Introduction," in Advances in K-means Clustering, ed: Springer Berlin Heidelberg, 2012, pp. 1-16.
    [17]D. Fisher, "Knowledge Acquisition Via Incremental Conceptual Clustering," Machine Learning, vol. 2, pp. 139-172, 1987/09/01 1987.
    [18]H. Yi and K. Sam, "Learning Assignment Order of Instances for the Constrained K-Means Clustering Algorithm," Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 39, pp. 568-574, 2009.
    [19]C.-L. Yang, "Adaptive Clustering Model for Improving Color Consistency," Doctor of Philosophy, Purdue University, 2009.
    [20]S. S. R. Ian Davidson, "Clustering With Constraints: Feasibility Issues and the k-Means Algorithm," Fifth SIAM International Conference on Data Mining, H. Kargupta, J. Srivastava, C. Kamath, and A. Goodman, SIAM, Newport, CA, 138-149., 2005.
    [21]Y. Zhao, G. Karypis, and U. Fayyad, "Hierarchical Clustering Algorithms for Document Datasets," Data Mining and Knowledge Discovery, vol. 10, pp. 141-168, 2005/03/01 2005.
    [22]R. A. Johnson and D. W. Wichern, Applied multivariate statistical analysis: Prentice-Hall, Inc., 2001.
    [23]T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin, "The elements of statistical learning: data mining, inference and prediction," The Mathematical Intelligencer, vol. 27, pp. 83-85, 2005/06/01 2005.
    [24]C. Combes and J. Azema, "Clustering using principal component analysis applied to autonomy–disability of elderly people," Decision Support Systems, vol. 55, pp. 578-586, 5// 2013.
    [25]H. Gao, W. Hong, J. Cui, and Y. Xu, "Optimization of Principal Component Analysis in Feature Extraction," in Mechatronics and Automation, 2007. ICMA 2007. International Conference on, 2007, pp. 3128-3132.
    [26]M. Palmer. (2013). Available: http://ordination.okstate.edu/PCA.htm
    [27]N. Gaitani, C. Lehmann, M. Santamouris, G. Mihalakakou, and P. Patargias, "Using principal component and cluster analysis in the heating evaluation of the school building sector," Applied Energy, vol. 87, pp. 2079-2086, 6// 2010.
    [28]L. I. Smith. (2002). A tutorial on Principal Components Analysis. Available: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
    [29]N. K. a. K. T. Visalakshi, "Impact of Nomalization in Distributed K-Means Clusteing," International Journal of Soft Computing, pp. 4: 168-172, 2009.
    [30]J. M. Quick. (2013). R Tutorial Series: Centering Variables and Generating Z-Scores with the Scale() Function.
    [31]R. D. Team. (2013). Documentation for package ‘base’ version 2.15.3. Available: http://stat.ethz.ch/R-manual/R-patched/library/base/html/scale.html
    [32]K. Singhs. (2008, 8 DEC 2013). Eigenvalues and Eigenvectors. Available: http://mathsforall.co.uk/userfiles/section%207a%2031st%20march%2008.doc
    [33]R. Andreani. (2013, 8 DEC 2013). Eigenvalues and Eigenvectors. Available: http://www.ime.unicamp.br/~andreani/matrizes/capitulo7.pdf
    [34]Applied multivariate statistical analysis: Prentice-Hall, Inc., 1988.
    [35]J. Xue, C. Lee, S. G. Wakeham, and R. A. Armstrong, "Using principal components analysis (PCA) with cluster analysis to study the organic geochemistry of sinking particles in the ocean," Organic Geochemistry, vol. 42, pp. 356-367, 5// 2011.
    [36]P. D. John R. Huguenard. Error Sum of Squares (SSE). Available: http://hlab.stanford.edu/
    [37]P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, (First Edition): Addison-Wesley Longman Publishing Co., Inc., 2005.

    QR CODE