在MapReduce中使用壓縮合併字典樹的均衡分群機制之研究

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳旭洹 Syu-Huan Chen
論文名稱：	在MapReduce中使用壓縮合併字典樹的均衡分群機制之研究 A Balanced Partitioning Mechanism with Condensed, Collapsed Trie in MapReduce
指導教授：	陳省隆 Hsing-Lung Chen
口試委員:	呂政修 Jenq-Shiou Leu 陳郁堂 Yie-Tarng Chen 莊博任 Po-jen Chuang
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2018
畢業學年度：	106
語文別：	中文
論文頁數：	60
中文關鍵詞：	大數據、偏態分佈、Hadoop 、負載平衡、MapReduce
外文關鍵詞：	big data, data skew, Hadoop, load balance, MapReduce
相關次數：	點閱：724 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

MapReduce 最近已成為處理大數據的高效率平台，經由消除資料關
聯性，然後將工作量分送至 reducers 做平行處理達成其目標，齊夫定律
說明，對於許多自然及社會科學所研究的大數據，如果把字詞按照使用
頻率由大到小排序，那麼使用頻率與序號之間幾乎恰好成反比，也就是
資料呈現偏態分佈的情況。然而，對於偏態分佈的資料，使用
MapReduce 的雜湊函數通常產生不均勻的工作量給 reducers，大大降低
MapReduce 的效能，因此發展一個均衡分配工作量的分群演算法是一
個值得探討的主題。
此論文研究旨在提出一個使用壓縮合併字典樹的均衡分群機制以
均勻分配工作量至 reducers。首先提出壓縮字典樹以確實掌握資料統計，
並使用合併字典樹有效限制記憶體使用量，然後使用準最佳裝箱演算
法將子群均勻分配至 reducers，以降低總執行時間。所提出的分群機制
只需要合理的記憶體使用量及少量額外的計算量。我們將對數個實際
大數據實作反向索引，以評估我們所提出分群機制的效能。

The MapReduce has emerged as an efficient platform for coping with
big data. It achieves this goal by decoupling the data and then distributing
the workloads to multiple reducers for processing in a fully parallel manner.
The hash function of MapReduce usually generates the unbalanced
workloads to multiple reducers for the skewed data. The unbalanced
workloads to multiple reducers lead to degrading the performance of
MapReduce significantly, because the overall running time of a map-reduce
cycle is determined by the longest running reducer. Thus, it is an important
issue to develop a balanced partitioning algorithm which partitions the
workloads evenly for all the reducers.
The aim of this proposal is to propose a balanced partitioning
mechanism with condensed trie in mapreduce, which evenly distributes the
data to the reducers. Then, we propose a quasi-optimal packing algorithm to
assign sub-partitions to the reducers evenly, resulting in reducing the total
execution time. The proposed partitioning mechanism requires a reasonable
amount of memory usage and incurs a small running overhead. The
experiments using Inverted Indexing on several real-world datasets are
conducted to evaluate the performance of our proposed partitioning
mechanism.

誌謝    2
摘要    3
Abstract    4
目錄    5
圖目錄    8
表目錄    10
第1章    導論    11
1    研究背景    11
2    研究動機與目的    12
2.1    研究動機    13
2.2    研究目的    14
3    後續章節內容    14
第2章    相關研究    15
1    沒做前置處理的方法    16
2    有做前置處理的方法    16
第3章    背景知識    22
1    Apache Hadoop    22
1.1    分散式檔案系統(HDFS)    22
1.2    MapReduce 平行運算架構    23
1.3    MapReduce執行過程    24
1.4    YARN    25
1.5    MapReduce、YARN、HDFS環境設定說明    26
1.6    MapReduce使用不同參數設定的影響    28
2    字典樹    29
2.1    字典樹的特性    30
2.2    字典樹的建立    30
第4章    壓縮合併字典樹與均衡分配機制    32
1    MapReduce專案應用介紹－反向索引    33
2    壓縮字典樹    34
3    合併字典樹    38
4    頻率統計    39
5    取樣    41
6    均衡分群機制    42
6.1    切割子切割塊    42
6.2    裝箱演算法    46
6.3    分群器    47
第5章    實驗環境與實驗結果    49
1    實驗環境    49
2    壓縮合併字典樹的效能評估    49
3    IR測試結果    50
4    反向索引測試結果    51
第6章    結論與未來展望    55
1    結論    55
2    未來展望    56
參考文獻    57


                                

[1] J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters," Communications of the Acm, vol. 51, pp. 107-113, Jan 2008.
[2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on, 2010, pp. 1-10.
[3] D. M. Powers, "Applications and explanations of Zipf's law," in Proceedings of the joint conferences on new methods in language processing and computational natural language learning, 1998, pp. 151-160.
[4] P. Dhawalia, S. Kailasam, and D. Janakiram, "Chisel: A resource savvy approach for handling skew in mapreduce applications," in Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on, 2013, pp. 652-660.
[5] V. Kumaresan and R. Baskaran, "AEGEUS: An online partition skew mitigation algorithm for mapreduce," in Proceedings of the International Conference on Informatics and Analytics, 2016, p. 100.
[6] B. Gufler, N. Augsten, A. Reiser, and A. Kemper, "Load balancing in mapreduce based on scalable cardinality estimates," in Data Engineering (ICDE), 2012 IEEE 28th International Conference on, 2012, pp. 522-533.
[7] S. Mohanapriya and P. Natesan, "A micropartitioning technique for massive data analysis using MapReduce," in Information Communication and Embedded Systems (ICICES), 2014 International Conference on, 2014, pp. 1-5.
[8] J. Myung, J. Shim, J. Yeon, and S.-g. Lee, "Handling data skew in join algorithms using MapReduce," Expert Systems with Applications, vol. 51, pp. 286-299, 2016.
[9] K. Slagter, C.-H. Hsu, Y.-C. Chung, and D. Zhang, "An improved partitioning mechanism for optimizing massive data analysis using MapReduce," Journal of Supercomputing, vol. 66, 2013.
[10] Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui, "Sampling-based partitioning in MapReduce for skewed data," in ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh, 2012, pp. 1-8.
[11] W. Yan, Y. Xue, and B. Malin, "Scalable and robust key group size estimation for reducer load balancing in MapReduce," in Big Data, 2013 IEEE International Conference on, 2013, pp. 156-162.
[12] W. Yan, Y. Xue, and B. Malin, "Scalable load balancing for mapreduce-based record linkage," in Performance Computing and Communications Conference (IPCCC), 2013 IEEE 32nd International, 2013, pp. 1-10.
[13] F. Atta, S. D. Viglas, and S. Niazi, "SAND Join—A skew handling join algorithm for Google's MapReduce framework," in Multitopic Conference (INMIC), 2011 IEEE 14th International, 2011, pp. 170-175.
[14] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, et al., "Apache hadoop yarn: Yet another resource negotiator," in Proceedings of the 4th annual Symposium on Cloud Computing, 2013, p. 5.

簡易檢索 / 詳目顯示

相關論文