簡易檢索 / 詳目顯示

研究生: 朱俊諺
Chun-Yen Chu
論文名稱: 利用並行化簡方式改善近鄰傳遞分群演算法
An improved affinity propagation clustering algorithm for map/reduce environment
指導教授: 吳怡樂
Yi-Leh Wu
口試委員: 陳建中
Jiann-Jone Chen
唐政元
Cheng-Yang Tang
何瑁鎧
Maw-Kae Hor
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 34
中文關鍵詞: Affinity PropagationMap/ReduceHadoopK-means分群演算法
外文關鍵詞: Affinity propagation, Map/Reduce, Hadoop, K-means, clustering algorithm
相關次數: 點閱:863下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於網路越趨發達,人們所需要處理的資料量也日趨龐大,而處理巨量資料所需要的成本十分龐大,故如何能以較小的代價處理巨量資料即成為一個已被認真看待的問題。Affinity Propagation演算法的特點為在進行分群時,並不用事先知道分群數量。演算法會分析資料內容找出適當的分群數進行分群。在本篇論文中我們利用Map/Reduce的方式改良了Affinity Propagation,利用多台電腦同時運算,以較小的代價及較快的時間完成巨量資料的分群。

    在實驗設計方面,我們先設計並完成了Map/Reduce Affinity Propagation的架構,進行資料分群的實驗,並將實驗結果與K-means比較。實驗結果顯示在巨量資料的分群上,Map/Reduce Affinity Propagation在準確度以及Davies–Bouldin index(DB Index)上都能獲得不錯的結。除此之外,我們將Map/Reduce Affinity Propagation的結果作為K-means的初始群集,也能有效的降低K-means計算所需的迭代次數。而運行時間上,當資料量越龐大時Affinity Propagation的優勢也會越加明顯。


    The Affinity Propagation (AP) is a clustering algorithm that does not require pre-set K cluster numbers. The AP works through message passing between nodes in the data and is more suitable for the data which is not well known. In this paper, we improve the original AP to Map/Reduce Affinity Propagation (MRAP) implemented in Hadoop, a distribute cloud environment. The architecture of MRAP is divided to multiple mappers and one reducer in Hadoop. With multiple processing nodes, we can make data processing faster and more efficient.
    In the experiments, we compare the clustering result of the proposed MRAP with the K-means method. The experiment results support that the proposed MRAP method has good performance in terms of accuracy and Davies–Bouldin index value. Also, by applying the proposed MRAP method can reduce the number of iterations before convergence for the K-means method irrespective to the data dimensions.

    論文摘要 I ABSTRACT II CONTENTS III LIST OF FIGURES IV LIST OF TABLES V CHAPTER 1. INTRODUCTION 1 CHAPTER 2. SYSTEM ARCHITECTURE 4 2.1 APACHE HADOOP 4 2.2 APACHE MAHOUT 6 2.2.1 K-means in Apache Mahout 7 CHAPTER 3. ALGORITHM DESIGN 9 3.1 AFFINITY PROPAGATION 9 3.2 MAP/REDUCE AFFINITY PROPAGATION 12 3.3 EVALUATIONS 16 CHAPTER 4. EXPERIMENT 20 4.1 RUNNING TIME TEST 21 4.2 DAVIES–BOULDIN INDEX TEST 22 4.3 ACCURACY TEST 25 4.4 K-MEANS INITIALIZATION TEST 27 4.5 DISCUSSION OF EXPERIMENTS 30 CHAPTER 5. CONCLUSIONS 31 REFERENCES 32

    [1] B. J. Frey and D. Dueck, “Clustering by Passing Messages Between Data Points”, Science 315, Pages 972–976, February 2007
    [2] I. E. Givoni and B. J. Frey, “A Binary Variable Model for Affinity Propagation ”, Neural Computation,Vol 21, issue 6, Pages 1589-1600, June 2009
    [3] C.D. Wang, J.H. Lai, C.Y. Suen, and J.Y. Zhu, “Multi-Exemplar Affinity Propagation”, Pattern Analysis and Machine Intelligence, IEEE Transactions, January 2013
    [4] Y.C. He, Q.C. Chen, X.L. Wang, R.F. Xu, X.H. Bai, and X.J. Meng, “An adaptive affinity propagation document clustering”, Informatics and Systems (INFOS), 2010 The 7th International Conference, Pages 1-7, March 2010
    [5] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, Communications of the ACM - 50th anniversary issue: 1958 - 2008, Volume 51 Issue 1, Pages 107-113, January 2008
    [6] Z.Q. Xu and D.W. Zhao, “Research on Clustering Algorithm for Massive Data Based on Hadoop Platform”, Computer Science & Service System (CSSS), Pages 43-45, Aug. 2012
    [7] M. A. Bhandarkar, “MapReduce programming with apache Hadoop”, International Parallel and Distributed Processing Symposium/International Parallel Processing Symposium - IPDPS(IPPS), Pages 19-23, April 2010
    [8] Y. Lai, “An Efficient Data Mining Framework on Hadoop using Java Persistence API”, Computer and Information Technology (CIT), July 2010.
    [9] M. Maurya and S. Mahajan, “Performance analysis of MapReduce programs on Hadoop cluster”, Information and Communication Technologies (WICT), Pages 505-510, 2012
    [10] R.M. Esteves, P. Rui, and C. Rong, “K-means Clustering in the Cloud -- A Mahout Test”, Advanced Information Networking and Applications (WAINA), Pages 514-519, March 2013
    [11] D. L. Davies, and D. W. Bouldin, “A Cluster Separation Measure”, Pattern Analysis and Machine Intelligence, Issue: 2, Pages 224-227, April 1979
    [12] J.A. Hartigan and M.A. Wong, “A K-Means Clustering Algorithm”, pplied Statistics, Vol. 28, No. 1, Pages 100-108. 1979
    [13] T. Kanungo, D. M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman and A.Y. Wu, “An efficient k-means clustering algorithm: Analysis and implementation”, Pattern Analysis and Machine Intelligence 24, Pages 881–892. 2002

    [14] D. Borthaku, “The hadoop distributed file system: Architecture and design”, The Apache Software Foundation, Hadoop Project Website, 2007
    [15] D. L. Olson, “Advanced Data Mining Techniques 1st edition”, February 2008
    [16] A.M. Fahim, A.M. Salem, F.A. Torkey, M.A. Ramadan and G. Saake , “An Efficient K-Means with Good Initial Starting Points”, Georgian Electronic Scientific Journal: Computer Science and Telecommunications, Vol. 2, No. 19, 2009
    [17] S. Babu, “Towards automatic optimization of MapReduce programs”, SoCC Proceedings of the 1st ACM symposium on Cloud computing, Pages 137-142. 2010
    [18] Y. Zhu, J. Yu, C.Y. Jia, “Initializing K-means Clustering Using Affinity Propagation”, Ninth International Conference on Hybrid Intelligent Systems, August 2010
    [19] “UCI Machine Learning Repository”, http://archive.ics.uci.edu/ml/, referenced on March 1st, 2013
    [20] “Statlog Datasets: comparison of results”, http://www.is.umk.pl/projects/datasets-stat.html, referenced on March 1st, 2013
    [21] “The Yale Face Database”, http://cvc.yale.edu/projects/yalefaces/yalefaces.html , referenced on March 1st, 2013
    [22] J. Zhang, G.Q. Wu, H.G Li, X.G. Hu, X.D. Wu, “A 2-Tier Clustering Algorithm with Map-Reduce”, ChinaGrid Conference Fifth Annual, July 2010
    [23] A. Radwan, “MapReduce 2.0 in Apache Hadoop 0.23” , http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/, referenced on May 1st, 2013

    QR CODE