簡易檢索 / 詳目顯示

研究生: 沈逸翔
Yi-Siang Shen
論文名稱: 於MapReduce中解決資料傾斜問題的均衡分群演算法
A Balanced-Partitioning Algorithm for Handling Data Skew in MapReduce
指導教授: 陳省隆
Hsing-Lung Chen
口試委員: 莊博任
Po-jen Chuang
吳乾彌
Chen-Mie Wu
呂政修
Jenq-Shiou Leu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 75
中文關鍵詞: 平行處理資料傾斜負載平衡
外文關鍵詞: Data Skew
相關次數: 點閱:180下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   近年來隨著智慧型手機(Smart Phone)、個人電腦(PC)、網際網路(Internet)的蓬勃發展,每分每秒的資訊量都在持續增長,全球對於大數據分析、雲端運算的應用也是日益漸增,而Apache Hadoop就是其中一項處理大數據資料的平台。隨著人類生活方式的改變,我們也有越來越多元的資料可供分析,但這些資料很多都是有資料傾斜(Data Skew)的特性,而擁有資料傾斜特性的資料往往會造成在平行處理時的一個資料量分配不均(Load Imbalancing)問題,資料量如果分配不均可能導致分配量較多的Reducer執行時間較長而拉長整體的執行時間。
      因此,本論文目的為解決因為資料傾斜特性而導致的資料量分配不均問題。我們使用一個MapReduce程式(均衡分群演算法)來處理前置作業,其中的Mapper來統計每個Key的頻率(Frequency),之後Reducer合併統計的結果,再使用切割子切割塊(Sub-Partition)的方法與均衡分配(Bucket packing)方法將每個Key均衡分配到適當的Partition中,最後讓實際應用中Reducer分配到的資料量都可以很均勻,達到負載平衡(Load Balancing)目的。


      Recently, with the advance of smart phone and internet, the issue of big data become more popular, big data analysis and cloud computing become more popular too. Apache Hadoop is one of the big data analysis platform. With the change of people’s living style, data type become more complex, many data have characteristic of skew. In parallel processing, data with skew may lead to load imbalancing problems, load imbalancing may cause Reducer spend more times which has more loading.
      Therefore, the purpose of this paper is how to handling data skew in MapReduce achieves load balancing. We are using a MapReduce program (Balanced-Partitioning algorithm) to handling pre-processing. Mapper to counts frequency of every key, Reducer to merges all Mapper output, using our cut sub-partition and bucket-packing methods assigning every key to correct partition. Finaly, made the Reducer of actual application load balancing.

    誌謝 1 摘要 2 Abstract 3 目錄 4 圖目錄 7 表目錄 9 第1章 導論 11 1.1 研究背景 11 1.2 研究動機與目的 12 1.2.1 研究動機 12 1.2.2 研究目的 13 1.3 後續章節內容 13 第2章 相關研究 15 2.1 有做前置處理的方法 15 2.2 沒有做前置處理的方法 18 第3章 背景知識 20 3.1 MapReduce 20 3.1.1 YARN詳細執行過程 20 3.1.2 MapReduce詳細過程 22 3.1.3 MapReduce、YARN、HDFS參數說明 24 3.1.4 MapReduce使用不同參數設定的影響 26 3.2 Trie 32 3.2.1 Trie的特性 33 3.2.2 Trie的建立 33 第4章 均衡分群演算法 36 4.1 頻率統計 37 4.2 切割子切割塊方法 41 4.3 均衡分配方法 47 4.3.1 均衡分配方法名詞定義 47 4.3.2 建立MaxHeap 48 4.3.3 均衡分配方法詳細說明 48 4.3.4 Bucket packing problem 54 4.3.5 對應表(Mapping Table) 54 4.4 Partitioner 55 4.4.1 在Partitioner中建立Trie 55 4.4.2 分群(Partitioning) 56 第5章 實驗環境與實驗結果 57 5.1 實驗環境 57 5.2 Application介紹-Inverted Index 57 5.3 Imbalance Ratio測試結果 58 5.4 Application測試結果 63 5.4.1 均衡分群演算法花費時間 63 5.4.2 Inverted Index花費時間 64 第6章 結論與未來展望 68 6.1 結論 68 6.2 未來展望 69 參考文獻 71

    [1] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107-113, 2008.
    [2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, 2010, pp. 1-10.
    [3] D. M. Powers, "Applications and explanations of Zipf's law," in Proceedings of the joint conferences on new methods in language processing and computational natural language learning, 1998, pp. 151-160.
    [4] S. Mohanapriya and P. Natesan, "A micropartitioning technique for massive data analysis using MapReduce," in Information Communication and Embedded Systems (ICICES), 2014 International Conference on, 2014, pp. 1-5.
    [5] K. Slagter, C.-H. Hsu, Y.-C. Chung, and D. Zhang, "An improved partitioning mechanism for optimizing massive data analysis using MapReduce," Journal of Supercomputing, vol. 66, 2013.
    [6] W. Yan, Y. Xue, and B. Malin, "Scalable and robust key group size estimation for reducer load balancing in MapReduce," in Big Data, 2013 IEEE International Conference on, 2013, pp. 156-162.
    [7] W. Yan, Y. Xue, and B. Malin, "Scalable load balancing for mapreduce-based record linkage," in Performance Computing and Communications Conference (IPCCC), 2013 IEEE 32nd International, 2013, pp. 1-10.
    [8] J. Myung, J. Shim, J. Yeon, and S.-g. Lee, "Handling data skew in join algorithms using MapReduce," Expert Systems with Applications, vol. 51, pp. 286-299, 2016.
    [9] B. Gufler, N. Augsten, A. Reiser, and A. Kemper, "Load balancing in mapreduce based on scalable cardinality estimates," in Data Engineering (ICDE), 2012 IEEE 28th International Conference on, 2012, pp. 522-533.
    [10] Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui, "Sampling-based partitioning in MapReduce for skewed data," in ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh, 2012, pp. 1-8.
    [11] F. Atta, S. D. Viglas, and S. Niazi, "SAND Join—A skew handling join algorithm for Google's MapReduce framework," in Multitopic Conference (INMIC), 2011 IEEE 14th International, 2011, pp. 170-175.
    [12] P. Dhawalia, S. Kailasam, and D. Janakiram, "Chisel: A resource savvy approach for handling skew in mapreduce applications," in Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on, 2013, pp. 652-660.
    [13] V. Kumaresan and R. Baskaran, "AEGEUS: An online partition skew mitigation algorithm for mapreduce," in Proceedings of the International Conference on Informatics and Analytics, 2016, p. 100.
    [14] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, et al., "Apache hadoop yarn: Yet another resource negotiator," in Proceedings of the 4th annual Symposium on Cloud Computing, 2013, p. 5.

    QR CODE