研究生: |
陳旭洹 Syu-Huan Chen |
---|---|
論文名稱: |
在MapReduce中使用壓縮合併字典樹的均衡分群機制之研究 A Balanced Partitioning Mechanism with Condensed, Collapsed Trie in MapReduce |
指導教授: |
陳省隆
Hsing-Lung Chen |
口試委員: |
呂政修
Jenq-Shiou Leu 陳郁堂 Yie-Tarng Chen 莊博任 Po-jen Chuang |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 中文 |
論文頁數: | 60 |
中文關鍵詞: | 大數據 、偏態分佈 、Hadoop 、負載平衡 、MapReduce |
外文關鍵詞: | big data, data skew, Hadoop, load balance, MapReduce |
相關次數: | 點閱:724 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
MapReduce 最近已成為處理大數據的高效率平台,經由消除資料關
聯性,然後將工作量分送至 reducers 做平行處理達成其目標,齊夫定律
說明,對於許多自然及社會科學所研究的大數據,如果把字詞按照使用
頻率由大到小排序,那麼使用頻率與序號之間幾乎恰好成反比,也就是
資料呈現偏態分佈的情況。然而,對於偏態分佈的資料,使用
MapReduce 的雜湊函數通常產生不均勻的工作量給 reducers,大大降低
MapReduce 的效能,因此發展一個均衡分配工作量的分群演算法是一
個值得探討的主題。
此論文研究旨在提出一個使用壓縮合併字典樹的均衡分群機制以
均勻分配工作量至 reducers。首先提出壓縮字典樹以確實掌握資料統計,
並使用合併字典樹有效限制記憶體使用量,然後使用準最佳裝箱演算
法將子群均勻分配至 reducers,以降低總執行時間。所提出的分群機制
只需要合理的記憶體使用量及少量額外的計算量。我們將對數個實際
大數據實作反向索引,以評估我們所提出分群機制的效能。
The MapReduce has emerged as an efficient platform for coping with
big data. It achieves this goal by decoupling the data and then distributing
the workloads to multiple reducers for processing in a fully parallel manner.
The hash function of MapReduce usually generates the unbalanced
workloads to multiple reducers for the skewed data. The unbalanced
workloads to multiple reducers lead to degrading the performance of
MapReduce significantly, because the overall running time of a map-reduce
cycle is determined by the longest running reducer. Thus, it is an important
issue to develop a balanced partitioning algorithm which partitions the
workloads evenly for all the reducers.
The aim of this proposal is to propose a balanced partitioning
mechanism with condensed trie in mapreduce, which evenly distributes the
data to the reducers. Then, we propose a quasi-optimal packing algorithm to
assign sub-partitions to the reducers evenly, resulting in reducing the total
execution time. The proposed partitioning mechanism requires a reasonable
amount of memory usage and incurs a small running overhead. The
experiments using Inverted Indexing on several real-world datasets are
conducted to evaluate the performance of our proposed partitioning
mechanism.
[1] J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," Communications of the Acm, vol. 51, pp. 107-113, Jan 2008.
[2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, 2010, pp. 1-10.
[3] D. M. Powers, "Applications and explanations of Zipf's law," in Proceedings of the joint conferences on new methods in language processing and computational natural language learning, 1998, pp. 151-160.
[4] P. Dhawalia, S. Kailasam, and D. Janakiram, "Chisel: A resource savvy approach for handling skew in mapreduce applications," in Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on, 2013, pp. 652-660.
[5] V. Kumaresan and R. Baskaran, "AEGEUS: An online partition skew mitigation algorithm for mapreduce," in Proceedings of the International Conference on Informatics and Analytics, 2016, p. 100.
[6] B. Gufler, N. Augsten, A. Reiser, and A. Kemper, "Load balancing in mapreduce based on scalable cardinality estimates," in Data Engineering (ICDE), 2012 IEEE 28th International Conference on, 2012, pp. 522-533.
[7] S. Mohanapriya and P. Natesan, "A micropartitioning technique for massive data analysis using MapReduce," in Information Communication and Embedded Systems (ICICES), 2014 International Conference on, 2014, pp. 1-5.
[8] J. Myung, J. Shim, J. Yeon, and S.-g. Lee, "Handling data skew in join algorithms using MapReduce," Expert Systems with Applications, vol. 51, pp. 286-299, 2016.
[9] K. Slagter, C.-H. Hsu, Y.-C. Chung, and D. Zhang, "An improved partitioning mechanism for optimizing massive data analysis using MapReduce," Journal of Supercomputing, vol. 66, 2013.
[10] Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui, "Sampling-based partitioning in MapReduce for skewed data," in ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh, 2012, pp. 1-8.
[11] W. Yan, Y. Xue, and B. Malin, "Scalable and robust key group size estimation for reducer load balancing in MapReduce," in Big Data, 2013 IEEE International Conference on, 2013, pp. 156-162.
[12] W. Yan, Y. Xue, and B. Malin, "Scalable load balancing for mapreduce-based record linkage," in Performance Computing and Communications Conference (IPCCC), 2013 IEEE 32nd International, 2013, pp. 1-10.
[13] F. Atta, S. D. Viglas, and S. Niazi, "SAND Join—A skew handling join algorithm for Google's MapReduce framework," in Multitopic Conference (INMIC), 2011 IEEE 14th International, 2011, pp. 170-175.
[14] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, et al., "Apache hadoop yarn: Yet another resource negotiator," in Proceedings of the 4th annual Symposium on Cloud Computing, 2013, p. 5.