簡易檢索 / 詳目顯示

研究生: 陳子奕
Zih-Yi Chen
論文名稱: 應用於缺失資料集的漸進式屬性優先K-Means分群演算法
A Novel Component Priority-Based and Incremental K-Means Clustering Algorithm for Imputing Incomplete Data
指導教授: 鍾國亮
Kuo-Liang Chung
口試委員: 蔡文祥
Wen-Hsiang Tsai
李同益
Tong-Yee Lee
花凱龍
Kai-Lung Hua
賴祐吉
Yu-Chi Lai
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 41
中文關鍵詞: 準確度分群最大期望算法缺失數據插補漸進式插補K-means
外文關鍵詞: Accuracy, Clustering, Expectation maximization, Incomplete data imputation, Incremental imputation, K-means
相關次數: 點閱:141下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 針對不完整數據的補值問題,本文提出了一種新穎且更有效的基於屬性優先順序的漸進式K-means(CPIK-means)算法。該算法首先使用基於幾何相關性的策略對所有屬性中第一優先序缺失的數據進行插補,然後基於該插補後的數據和現有完整數據,應用K-means 對數據集進行分群。在每個聚類中,屬性中的第二優先序每個缺失數據都透過最大期望算法進行插補,同時更新上一輪插補的第一優先序屬性。同樣的,基於現有的完整屬性,包括已插補的完整數據和原始完整數據,重複上述分群-插補-更新的過程,直到所有缺失的資料都完成插補。詳細的實驗結果表明,我們所提出的CPIK-means算法優於最先進的算法。


    To deal with the incomplete data imputation problem, in this paper, we propose a novel and more effective component priority-based and incremental K-means (CPIK-means) algorithm. The proposed algorithm initially imputes the missing attributes in the first-priority component by using a geometry-based correlation strategy, and then based on this imputed component and existing complete components, K-means is applied to partition the dataset. In each cluster, every missing attribute in the second priority component is imputed by the Expectation maximization method, and the imputed first priority component is updated simultaneously. In the same argument, based on the existing complete components, which consist of the imputed and original complete components, the above clustering imputation-update process is repeated until all incomplete components are imputed. Detailed experimental results demonstrate that the proposed CPIKmeans algorithm outperforms the state-of-the-art algorithms.

    Abstract in Chinese Abstract in English Contents List of Figures List of Tables List of Algorithms 1 Introduction 1.1 Related works 1.2 Contributions 2 The proposed CPIK-means incomplete data imputation algorithm 3 Experimental Results 3.1 Accuracy (ACC) comparison 3.2 The F-score comparison 3.3 The normalized mutual information (NMI) performance comparison 4 Conclusion Reference

    [1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification and scene analysis. Wiley New York, 1973.
    [2] J. A. Hartigan, Clustering algorithms. John Wiley and Sons, Inc., 1975.
    [3] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM computing surveys, vol. 31, pp. 264–323, 1999.
    [4] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on neural networks, vol. 16, pp. 645–678, 2005.
    [5] D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Annals of Data Science, vol. 2, pp. 165–193, 2015.
    [6] A. Gersho and R. M. Gray, Vector quantization and signal compression. Springer Science & Business Media, 1992.
    [7] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory, vol. 28, pp. 129–137, 1982.
    [8] G. E. Batista, M. C. Monard, et al., “A study of k-nearest neighbour as an imputation method,” His, vol. 87, p. 48, 2002.
    [9] J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media, 1981.
    [10] Z. Zivkovic, “Improved adaptive gaussian mixture model for background subtraction,” in Proceedings of the 17th International Conference on Pattern Recognition, pp. 28–31, 2004. 23
    [11] R. J. Little and D. B. Rubin, Statistical analysis with missing data. John Wiley and Sons, 2019.
    [12] J. L. Schafer, Analysis of incomplete multivariate data. CRC press, 1997.
    [13] P. E. Hart, D. G. Stork, and R. O. Duda, Pattern classification. Wiley Hoboken, 2000.
    [14] T. Li, L. Zhang, W. Lu, H. Hou, X. Liu, W. Pedrycz, and C. Zhong, “Interval kernel fuzzy c-means clustering of incomplete data,” Neurocomputing, vol. 237, pp. 316–331, 2017.
    [15] R. J. Hathaway and J. C. Bezdek, “Fuzzy c-means clustering of incomplete data,” IEEE Transactions on Systems, vol. 31, pp. 735–744, 2001.
    [16] Y. Zhang, M. Li, S. Wang, S. Dai, L. Luo, E. Zhu, H. Xu, X. Zhu, C. Yao, and H. Zhou, “Gaussian mixture model clustering with incomplete data,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 17, pp. 1–14, 2021.
    [17] D. P. Mesquita, J. P. Gomes, and L. R. Rodrigues, “K-means for datasets with missing attributes: Building soft constraints with observed and imputed values,” in European Symposium on Artificial Neural Networks, 2016.
    [18] X. Gong, J. Zhang, and Y. Shi, “Research on data filling algorithm based on improved k-means and information entropy,” in 2018 IEEE 4th International Conference on Computer and Communications, 2018. 24
    [19] S. Wang, M. Li, N. Hu, E. Zhu, J. Hu, X. Liu, and J. Yin, “K-means clustering with incomplete data,” IEEE Access, vol. 7, pp. 69162– 69171, 2019.
    [20] S. F. Hussain and M. Haris, “A k-means based co-clustering (kcc) algorithm for sparse, high dimensional data,” Expert Systems with Applications, vol. 118, pp. 20–34, 2019.
    [21] P. Wang and X. Chen, “Three-way ensemble clustering for incomplete data,” IEEE Access, vol. 8, pp. 91855–91864, 2020.
    [22] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B, vol. 39, pp. 1–22, 1977.
    [23] J. Van Hulse and T. M. Khoshgoftaar, “Incomplete-case nearest neighbor imputation in software measurement data,” Information Sciences, vol. 259, pp. 596–610, 2014.
    [24] S. Xia, D. Peng, D. Meng, C. Zhang, G. Wang, E. Giem, W. Wei, and Z. Chen, “Ball k-means: Fast adaptive clustering with no bounds,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, pp. 87–99, 2022.
    [25] Y. Yao, “Three-way decisions with probabilistic rough sets,” Information Sciences, vol. 180, pp. 341–353, 2010.
    [26] T. Schneider, “Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values,” Journal of climate, vol. 14, pp. 853–871, 2001. 25
    [27] A. Amelio and C. Pizzuti, “Is normalized mutual information a fair measure for comparing community detection methods,” p. 1584– 1585, 2015.

    無法下載圖示 全文公開日期 2025/06/29 (校內網路)
    全文公開日期 2025/06/29 (校外網路)
    全文公開日期 2025/06/29 (國家圖書館:臺灣博碩士論文系統)
    QR CODE