簡易檢索 / 詳目顯示

研究生: 林義昌
I-Chang Lin
論文名稱: 基於雲端運算的高效率DBSCAN分群演算法
An Efficient DBSCAN Clustering Algorithm Based on Cloud Computing
指導教授: 戴碧如
Bi-Ru Dai
口試委員: 吳怡樂
Yi-Leh Wu
蔡曉萍
Hsiao-Ping Tsai
黃仁暐
Jen-Wei Huang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 44
中文關鍵詞: 資料探勘資料分群雲端運算DBSCAN資料分割MAP/REDUCEHADOOP
外文關鍵詞: data mining, data clustering, DBSCAN, Map/Reduce, data partition, Hadoop, cloud computing
相關次數: 點閱:206下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • DBSCAN是一個好的以密度為基礎的分群演算法,他可以在任意形狀中辨識出群聚且處理有雜訊的資料集。然而,隨著資料量不斷的增加,DBSCAN演算法在單機執行,有著延展性的問題。在這篇論文,我們推薦一個以Map/Reduce為基礎的DBACAN演算法叫做DBSCAN-MR來解決演展性的問題。在DBSCAN-MR,輸入資料集合會被分割成較小的部分然後在Hadoop平台上平行處理。然而,選擇不同的資料切割機制將會影響執行的效率和每個節點的負載平衡。因此,我們推薦一個方法,靠著減少邊緣點的切割(PRBP),選擇一個以資料分佈邊緣為基礎的資料切割方法。我們的實驗結果顯示,DBSCAN-MR結合PRBP的設計相較於比較對象是有較高的效率且更具延展性。


    DBSCAN is a well-known algorithm for density-based clustering because it can identify the groups of arbitrary shapes and deal with noisy datasets. However, with the increasing amount of data, DBSCAN algorithm running on a single machine has to face the scalability problem. In this paper, we propose a Map/Reduce-based DBSCAN algorithm called DBSCAN-MR to solve the scalability problem. In DBSCAN-MR, the input dataset is partitioned into smaller parts and then parallel processed on the Hadoop platform. However, choosing different partition mechanisms will affect the execution efficiency and load balance of each node. Therefore, we propose a method, partition with reduce boundary points (PRBP), to select partition boundaries based on the distribution of data points. Our experimental results show that DBSCAN-MR with the design of PRBP has higher efficiency and scalability than competitors.

    Abstract 論文摘要 致 謝 Table of Contents List of Figures List of Tables 1. Introduction 1.1 Background 1.2 Motivation and Contribution 1.3 Thesis Organization 2. Related Works 2.1 Density-Based Clustering 2.2 DBSCAN 2.3 Hadoop Map/Reduce 3. Distributed Density-Based Clustering with MapReduce 3.1 Partition with Reduced Boundary Points 3.2 DBSCAN-Map 3.3 DBSCAN-Reduce 3.4 Merge Boundary Result 3.5 Relabel Data Points 3.6 Data Sampling Partition 4 Experiment Study 4.1 Experimental Designs 4.2 Experimental Results 5 Conclusion and Future Works Reference

    1. Fayyad, Usama; Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996). "From Data Mining to Knowledge Discovery in Databases"
    2. Han, P.N., Kamber, M.: Data Mining: Concepts and Techniques, 2ed(2006)
    3. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining (2006)
    4. J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA, 2001, pp. 335–391
    5. Jen-Wei Huang, Su-Chen Lin and Ming-Syan Chen, “DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud,” Lecture Notes in Computer Science, 2010, Volume 6119/2010, 27-34
    6. C. Moretti, J. Bulosan, D. Thain, and P. Flynn. All-pairs: An abstraction for data-intensive cloud computing. In IEEE/ACM International Parallel and Distributed Processing Symposium, April 2008
    7. WHITE, B., YEH, T., LIN, J., AND DAVIS, L. 2010. Web-scale computer vision using mapreduce for multimedia data mining. Proceedings of the Tenth International Workshop on Multimedia Data Mining, 1–10
    8. Y. Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In Proc of 22nd SSDBM, 2010.
    9. M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 226–231
    10. Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality. In: Szczuka, M. (ed.) RSCTC 2010. LNCS, vol. 6086, pp. 60–69. Springer, Heidelberg (2010)
    11. EL-SONBATY, Y., ISMAIL, M. A., AND FAROUK, M. 2004. An efficient density-based clustering algorithm for large databases. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04). IEEE Computer Society, 673–677
    12. R.T. Ng and J. Han.:” Efficient and Effective Clustering Methods for spatial data mining”, Proc. 20th Int. Conf. on Very Large Data Bases, santiago, Chile, 1994, pages 144-155
    13. Mahran, Shaaban, and Khaled Mahar, 2008: Using Grid for Accelerating Desnity-Based Clustering. 8th IEEE International Conference on Computer and Information Technology, pp. 35-40.
    14. Dean, J., Ghemawat, S.: Mapreduce: Simplified dataprocessing on large clusters. In: Symp. on Operating System Design and Implementation (2004)
    15. Hadoop, http://hadoop.apache.org
    16. M. Ankerst, M.M. Breunig, H.-P. Kriegel, J. Sander, OPTICS: Ordering points to identify the clustering structure, in: Proceedings of ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, 1999, pp. 49–60
    17. A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimedia databases with noise, in: Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, New York City, NY, 1998, pp. 58–65
    18. S. Ma, T.J. Wang, S.W. Tang, D.Q. Yang, J. Gao, A new fast clustering algorithm based on reference and density, in: Proceedings of WAIM, Lectures Notes in Computer Science, 2762, Springer, 2003, pp. 214–225
    19. Z. Aoying, Z. Shuigeng, Approaches for scaling DBSCAN algorithm to large spatial database, Journal of Computer Science and Technology 15 (6) (2000) 509–526.
    20. M. Ester, H.-P. Kriegel, J. Sander, X. Xu, Clustering for mining in large spatial databases, KI-Journal (Artificial Intelligence) 12 (1) (1998) 18–24, Special Issue on Data Mining.
    21. Cheng, H., Tan, P.-N., Sticklen, J., Punch, W.F.: Recommendation via query cen-tered random walk on k-partite graph. In: Proc. of Intl. Conf. on Data Mining, pp. 457–462 (2007)
    22. Chilson, J., Ng, R., Wagner, A., Zamar, R.: Parallel computation of high dimen-sional robust correlation and covariance matrices. In: Proc. of Intl. Conf. on Knowl-edge Discovery and Data Mining, August 2004, pp. 533–538 (2004)
    23. Kargupta, H., Das, K., Liu, K.: Multi-party, privacy-preserving distributed data mining using a game theoretic framework. In: Proc. of European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp. 523–531 (2007)
    24. Cheng, H., Tan, P.-N., Sticklen, J., Punch, W.F.: Recommendation via query cen-tered random walk on k-partite graph. In: Proc. of Intl. Conf. on Data Mining, pp. 457–462 (2007)
    25. Luo, P., Xiong, H., Lu, K., Shi, Z.: Distributed classification in peer-to-peer networks. In: Proc. of Intl. Conf. on Knowledge Discovery and Data Mining, pp. 968–976 (2007)
    26. Wolff, R., Bhaduri, K., Kargupta, H.: A generic local algorithm for mining data streams in large distributed systems. IEEE Trans. on Knowledge and Data Engi-neering 21(4), 465–478 (2009)
    27. Xu, X., Yuruk, N., Feng, Z., Schweiger, T.A.J.: Scan: A structural clustering algorithm for networks. In: Proc. of Intl. Conf. on Knowledge Discovery and DataMining, pp. 824–833 (2007)
    28. Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, Jianping Fan, "MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce," icpads, pp.473-480, 2011 IEEE 17th International Conference on Parallel and Distributed Systems, 2011
    29. V. Gaede and O. G¨unther, “Multidimensional access methods,” ACM Comput. Surv., vol. 30, no. 2, pp. 170–231, 1998
    30. G. Karypis, E. H. Han, and V. Kumar, “CHAMELEON: A hierarchical clustering algorithm using dynamic modeling,” Computer, vol. 32, no. 8, pp. 68–75, 1999.
    31. USGS geoname datasets; http://geonames.usgs.gov/

    無法下載圖示 全文公開日期 2017/07/31 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE