Basic Search / Detailed Display

Author: 彭宥學
Yu-Hsueh Peng
Thesis Title: 基於神經網路正切核之譜分群
Spectral Clustering based on Neural Tangent Kernel
Advisor: 徐俊傑
Chiun-Chieh Hsu
Committee: 王有禮
Yue-Li Wang
賴源正
Yuan-Cheng Lai
Degree: 碩士
Master
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2022
Graduation Academic Year: 110
Language: 中文
Pages: 79
Keywords (in Chinese): 譜分群神經網路正切核神經網路特徵向量特徵分解
Keywords (in other languages): Spectral clustering, Neural tangent kernel, Neural network, Eigenvector, Eigen decomposition
Reference times: Clicks: 472Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 網際網路迅速發展與普及化的現代,造就人人都能是內容創作者,從而產生出大量資料,因此有效區分這些大量資料變成了一個重要的問題。而資料探勘即被利用在這問題上,因為資料探勘可以幫助使用者分析大量的資訊,並取得有用的情報。分群是資料探勘裡的一種重要方法,其旨在不使用任何資料標籤的情況下區分資料,使同群內的資料盡可能相似而群之間的資料盡可能不相似。

    有很多學者提出使用神經網路來進行區分資料的演算法,其中一種分群的方式為半監督式學習(Semi-Supervised Learning)的演算法,以人工給少量資料標上標籤作為標籤資料集來訓練神經網路,隨後使用訓練後的神經網路來給無標籤的資料標上偽標籤,再同時使用標籤資料集與偽標籤反覆訓練該網路,最終即可用訓練後的網路來區分資料。如果想把這類演算法改成分群演算法的話,則要思考如何在資料標籤移除掉的情況下使用神經網路對資料進行分群。

    本研究提出了基於神經網路正切核之譜分群(Spectral Clustering based on Neural Tangent Kernel,SCNTK)來解決上述問題。SCNTK是使用一個無訓練的初始神經網路且過程中無需使用資料標籤的演算法來計算譜分群時所需的相似度矩陣,再對其相似度矩陣進行特徵分解後得出特徵向量,最後將特徵向量透過k-means演算法把各個資料點進行分群。根據實驗結果顯示本研究所提之演算法能有效的解決上述問題,同時與其他近年分群演算法相比有更好的結果。


    In the modern age, the internet has gained rapid development and popularization. Because everyone now has the opportunity to become a content creator, a large amount of data has been created. Therefore, distinguishing large amounts of data efficiently becomes an important issue. To provide a solution, data mining is exploited since it has the power to help users analyze enormous amounts of information and extract useful intelligence. Clustering is an important method in data mining, which aims to distinguish data without using any data labels. In clustering, the data in the same cluster should be as similar as possible and the data between different clusters should be as dissimilar as possible.

    Many scholars have proposed algorithms to cluster data using neural networks. One of the neural-network-based clustering method is the semi-supervised learning algorithm, which manually labels a small amount of data as a label data set for training. The neural network then utilizes the trained neural network to label unlabeled data with pseudo labels. Afterward, it trains the neural network repeatedly using both the labeled dataset and the pseudo labels. Henceforth, the trained neural network can be used to distinguish data. However, to change these kinds of algorithms into clustering algorithms, it is necessary to consider how to use a neural network to cluster the data when the data labels are removed.

    In this study, Spectral Clustering based on Neural Tangent Kernel (SCNTK) is proposed for solving the problems mentioned above. SCNTK is an untrained initial neural network that uses an algorithm without the data labels to calculate the similarity matrix required for clustering. After that, the eigen decomposition of the similarity matrix is carried out. Then the eigenvectors can be obtained, and the eigenvectors are converted into various data forms by using the k-means method. According to the experiment results, the method proposed in this study can effectively solve the problems listed above, and the performance is also better than other recently developed clustering methods.

    摘要 I Abstract II 目錄 IV 圖目錄 VII 表目錄 VIII 第一章、緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 論文架構 3 第二章、文獻探討 4 2.1 相似度計算方法 4 2.1.1 歐幾里得距離 4 2.1.2 餘弦相似度 4 2.1.3 徑向基函數核 5 2.2 分群演算法 5 2.2.1 K-means 6 2.2.2 Fuzzy C-Means 7 2.2.3 DBSCAN 9 2.2.4 OPTICS 11 2.2.5 階層式分群 13 2.2.6 譜分群 15 2.3 建立相似度矩陣相關文獻 17 2.3.1 LSC演算法 18 2.3.2 U-SPEC演算法 19 2.4 神經網路正切核 23 第三章、基於神經網路正切核之譜分群 25 3.1 SCNTK整體架構圖 25 3.2 資料預處理 26 3.3 隨機選擇混合K-means選擇代表點 26 3.4 建立相似度矩陣 26 3.5 特徵分解與分群 30 3.6 隨機選擇混和譜分群選擇代表點 31 3.7 SCNTK演算法 32 3.7.1 SCNTK-KM 32 3.7.2 SCNTK-SC 33 3.8 SCNTK時間複雜度 34 第四章、實驗結果與參數分析 36 4.1 實驗環境與資料集 36 4.2 相關演算法與評估指標 37 4.2.1 相關演算法簡介 37 4.2.2 評估指標 38 4.3 實驗分析與結果 40 4.3.1 USPS資料集中進行比較 40 4.3.2 PenDigits資料集中進行比較 44 4.3.3 letter資料集中進行比較 46 4.3.4 MNIST資料集中進行比較 49 4.3.5 Covertype資料集中進行比較 52 4.3.6 比較結果總結 55 4.4 SCNTK參數分析 55 4.4.1 SCNTK代表點與候選點比例分析 55 4.4.2 SCNTK之參數w比較 58 4.4.3 SCNTK相似度矩陣調整比較 59 4.4.4 SCNTK參數設定建議 60 4.5 時間複雜度比較 61 第五章、結論與未來發展 62 5.1 結論 62 5.2 未來發展 63 參考文獻 64

    [1] K. Allab, L. Labiod, and M. Nadif, “Simultaneous spectral data embedding and clustering”, IEEE Trans. Neural Netw. Learn. Syst., Vol. 29, No. 12, pp. 6396–6401, Dec. 2018.
    [2] C. J. Alpert and A. B. Kahng, “Multiway partitioning via geometric embeddings, orderings, and dynamic programming”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 4, No. 11, pp. 1342-1358, Nov. 1995.
    [3] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS: ordering points to identify the clustering structure”, ACM SIGMOD Record, Vol. 28, pp. 49-60, No. 2, Jun. 1999.
    [4] J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering algorithm”, Computers & geosciences, Vol. 10, No. 2-3, pp. 191-203, 1984.
    [5] M. Bibi, W. Aziz, M. Almaraashi, I. H. Khan, M. S. A. Nadeem, and N. Habib, ‘‘A cooperative binary-clustering framework based on majority voting for Twitter sentiment analysis’’, IEEE Access, Vol. 8, pp. 68580-68592, Mar. 2020.
    [6] A. Bryant, and K. Cios, “RNN-DBSCAN: A density-based clustering algorithm using reverse nearest neighbor density estimates”, IEEE Transactions on Knowledge and Data Engineering, Vol. 30, No. 6, pp. 1109-1121, Jun. 2018.
    [7] D. Cai, and X. Chen, “Large scale spectral clustering via landmarkbased sparse representation”, IEEE Trans. Cybern., Vol. 45, No. 8, pp. 1669-1680, Aug. 2015.
    [8] W. Y. Chen, Y. Song, H. Bai, C. J. Lin, and E. Y. Chang, “Parallel spectral clustering in distributed systems”, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 33, No. 3, pp. 568-586, Mar. 2011.
    [9] X. Chen, D. He, W. Hong, M. Yang, F. Nie, and J. Z. Huang, “Spectral clustering of large-scale data by directly solving normalized cut”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1206-1215, Jul. 2018.
    [10] C. H. Ding, T. Li, and M. I. Jordan, “Convex and semi-nonnegative matrix factorizations”, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 32, No. 1, pp. 45-55, Jan. 2008.
    [11] W. Dong, C. Moses, and K. Li, “Efficient k-nearest neighbor graph construction for generic similarity measures”, Proc. Int. Conf. World Wide Web, pp. 577-586, 2011.
    [12] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise”, Proceedings of the Second International Conf. on Knowledge Discovery and Data Mining, pp. 226-231, Aug. 1996.
    [13] C. Fahy, S. Yang, and M. Gongora, “Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams”, IEEE Trans. Cybern., Vol. 49, No. 6, pp. 2215-2228, Jun. 2019.
    [14] H. Gan, “Safe Semi-Supervised Fuzzy C-Means Clustering”, IEEE Access, Vol. 7, pp. 95659-95664, Jul. 2019.
    [15] L. He, N. Ray, Y. Guan, and H. Zhang, “Fast large-scale spectral clustering via explicit feature mapping”, IEEE Trans. Cybern., Vol. 49, No. 3, pp. 1058-1071, Mar. 2019.
    [16] W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank, “Semantic-Based Surveillance Video Retrieval”, IEEE Transactions on Image Processing, Vol. 16, No. 4, pp. 1168-1181, Apr. 2007.
    [17] D. Huang, J. H. Lai, and C. D. Wang, “Robust Ensemble Clustering Using Probability Trajectories”, IEEE Transactions on Knowledge and Data Engineering, Vol. 28, No. 5, pp. 1312-1326, May. 2016.
    [18] D. Huang, C. D. Wang, and J. H. Lai, “Locally weighted ensemble clustering”, IEEE Trans. Cybern., Vol. 48, No. 5, pp. 1460–1473, May 2018.
    [19] D. Huang, C. D. Wang, H. Peng, J. Lai, and C. K. Kwoh, ‘‘Enhanced ensemble clustering via fast propagation of cluster-wise similarities’’, IEEE Transactions on Systems, Man, and Cybernetics: Systems, Vol. 51, No. 1, pp. 508-520, Jan. 2021.
    [20] D. Huang, C. D. Wang, J. S. Wu, J. H. Lai, and C. K. Kwoh, “Ultra-Scalable Spectral Clustering and Ensemble Clustering”, IEEE Transactions on Knowledge and Data Engineering, Vol. 32, No. 6, pp. 1212-1226, Jun. 2020.
    [21] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: convergence and generalization in neural networks”, Proceedings of the 32nd International Conf. on Neural Information Processing Systems, pp. 8580-8589, Dec. 2018.
    [22] S. C. Johnson, “Hierarchical clustering schemes”, Psychometrika, Vol. 32, pp. 241-254, 1967.
    [23] L. Kaufman, and P. J. Rousseeuw, “Finding groups in data: an introduction to cluster analysis” John Wiley & Sons, 2009.
    [24] B. Li, D. Pi, Y. Lin, and L. Cui, “DNC: A Deep Neural Network-based Clustering-oriented Network Embedding Algorithm”, Journal of Network and Computer Applications, Vol. 173, Jan. 2021.
    [25] Z. Li, X. M. Wu, and S. F. Chang, “Segmentation using superpix-els: A bipartite graph partitioning approach”, IEEE Conf. Comput. Vis. Pattern Recognit, pp. 789-796, Jun. 2012.
    [26] H. Liu, J. Wu, T. Liu, D. Tao, and Y. Fu, “Spectral Ensemble Clustering via Weighted K-Means: Theoretical and Practical Evidence”, IEEE Transactions on Knowledge and Data Engineering, Vol. 29, No. 5, pp. 1129-1143, May. 2017.
    [27] U. v. Luxburg, “A tutorial on spectral clustering”, Stat Comput, Vol. 17, pp. 395-416, 2007.
    [28] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B.Barak, and I. Sutskever, “Deep double descent: where bigger models and more data hurt”, Journal of Statistical Mechanics: Theory and Experiment, Vol. 2021:124003, Dec. 2021.
    [29] J. Shi, and J. Malik, “Normalized cuts and image segmentation”, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22, No. 8, pp. 888-905, Aug. 2000.
    [30] J. H. Wang, J. D. Rau, and W -J. Liu, “Two-Stage Clustering via Neural Networks”, IEEE Transactions on Neural Networks, Vol. 14, No. 3, pp. 606-615, May. 2003.
    [31] R. Wang, F. Nie, and W. Yu, “Fast Spectral Clustering with Anchor Graph for Large Hyperspectral Images”, IEEE Geoscience and Remote Sensing Letters, Vol. 14, No. 11, pp. 2003-2007, Sep. 2017.
    [32] Y. Wei, C. Niu, Y. Wang, H. Wang, and D. Liu, “The fast spectral clustering based on spatial information for large scale hyperspectral image”, IEEE Access, Vol. 7, pp. 141045-141054, 2019.
    [33] J. S. Wu, W. S. Zheng, J. H. Lai, and C. Y. Suen, "Euler clustering on large-scale dataset", IEEE Trans. Big Data, Vol. 4, No. 4, pp. 502-515, Dec. 2018.
    [34] R. Xin, J. Zhang, and Y. Shao, “Complex Network Classification with Convolutional Neural Network”, Tsinghua Science and Technology, Vol. 25, No. 4, pp. 447-457, Aug. 2020.
    [35] W. Xing, and Y. Bei, “Medical Health Big Data Classification Based on KNN Classification Algorithm”, IEEE Access, Vol. 8, pp. 28808–28819, Nov. 2019.
    [36] Z. Yu, L. Li, J. You, H. S. Wong, and G. Han, "SC³: Triple Spectral Clustering-Based Consensus Clustering Framework for Class Discovery from Cancer Gene Expression Profiles", IEEEIACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 6, pp. 1751-1765, Nov.-Dec. 2012.
    [37] Y. M. Zhang, K. Huang, G. Geng, and C. L. Liu, “Fast kNN graph construction with locality sensitive hashing”, Proc. Eur. Conf. Mach. Learn. Principles Practice Knowl. Discovery Databases, pp. 660-674, 2013.
    [38] L. Zhuang, Z. Zhou, S. Gao, J. Yin, Z. Lin, and Y. Ma, “Label Information Guided Graph Construction for Semi-Supervised Learning”, IEEE Transactions on Image Processing, Vol. 26, No. 9, pp. 4182-4192, Sep. 2017.

    無法下載圖示 Full text public date 2027/06/10 (Intranet public)
    Full text public date This full text is not authorized to be published. (Internet public)
    Full text public date This full text is not authorized to be published. (National library)
    QR CODE