高效能異質性Hadoop架構｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	楊明憲 Ming-hsien Yang
論文名稱：	高效能異質性Hadoop架構 High-Performance Heterogeneous Hadoop Architecture
指導教授：	徐勝均 Sendren Sheng-Dong Xu
口試委員:	陳佳堃 Jia-kun Chen 李俊賢 Jin-shyan Lee
學位類別：	碩士 Master
系所名稱：	工程學院 - 自動化及控制研究所 Graduate Institute of Automation and Control
論文出版年：	2013
畢業學年度：	101
語文別：	中文
論文頁數：	93
中文關鍵詞：	Hadoop 、HDFS 、MapReduce
外文關鍵詞：	Hadoop, HDFS, MapReduce
相關次數：	點閱：636 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

原生Hadoop是由master與slave所組成的「兩層式」架構，master主要負責管理slave，slave被視作資料節點(DataNode)與運算節點(TaskTracker)之結合。使用者可利用增加slave節點以提昇巨量資料平行運算之效能，故長期以來，Hadoop被視作處理巨量資料的關鍵技術。雖然Hadoop可提升效能，但是由大量slave所組成的Hadoop叢集(cluster)卻會造成實體架構較為龐大與較多的能源消耗。因此本文利用ARM架構低耗能、對大量資料處理之高效能以及體積小之特性與x86組合成全新的Hadoop「異質性三層」架構，並將「動態分配檔案區塊演算法」概念導入任務排程規劃，不僅改善原架構之缺點，亦有效縮短22%以上Map/Reduce的運算時間。

Native Hadoop is a two-layered structure composed of one master and many slaves. Therein slave can be seen as the combination of DataNode and TaskTracker, while master is in charge of managing slave nodes. Since users may add more slave nodes in the Hadoop to increase the efficiency of parallel computing of massive data, Hadoop has been viewed as the key technology of massive data processing. Although Hadoop can promote the efficiency, the Hadoop cluster, composed of a large number of slaves, makes the real structure larger and consumes more energy. Therefore, this study combines ARM and x86 to form the new “Heterogeneously Three-Layered” Hadoop structure, based on ARM's characteristics which owns the characteristics: energy saving, high performance of massive data processing, and small space. Moreover, the concept of “Dynamically Managing Block Algorithm” is introduced to the task scheduler. This design not only can improve shortcomings in native Hadoop but also can effectively reduce more than 22% Map/Reduce operation time.

致謝 ………………………………………………………………………….I
中文摘要 …………………………………….……..………….……………II
目錄 ……………………………………………..…………….……………IV
圖目錄 ……………………………………...…………………..…………..VI
表目錄 …………………………………………………..……………….VIII
第一章	序論	1
1研究背景與動機	1
2本文架構	2
第二章	技術及理論探討	3
1 雲端運算	3
1.1雲端設備服務(IAAS)	3
1.2雲端軟體服務(SAAS)	5
1.3雲端平台服務(PAAS)	6
2 HADOOP	9
2.1 HDFS	10
2.2 MapReduce	11
3 HDFS檔案讀取分析	13
3.1 FSDATAINPUTSTREAM物件的創建	14
3.2 FSDATAINPUTSTREAM進行檔案讀取	16
4 HDFS檔案寫入分析	21
4.1 FSDataOutputStream物件的創建	22
4.2FSDataOutputStream進行檔案寫入	25
第三章	文獻探討與研究方式	32
1文獻探討	32
2異質型HADOOP架構	37
2.1將4顆ARM來取代1顆x86	37
2.2三層式架構	42
3 HDFS原始碼修改	45
4MAPREDUCE動態任務分配	49
5動態分配檔案區塊演算法	54
6 MAPREDUCE原始碼修改	58
第四章	實驗	62
1硬體實驗環境與HADOOP參數設定	62
2 MAPREDUCE程式範例	65
3 WORDCOUNT.JAVA輸入資料樣式	66
4實驗結果	67
4.1原生架構與三層式架構	68
4.2 HDFS集中儲存與分散儲存	70
4.3新排程演算法	72
第五章	結論及未來展望	74
參考文獻	75

                                

[1] “Maxtor,” http://www.seagate.com/

[2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes and R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” ACM Transactions on Computer Systems, vol. 26, no. 2, Article 4, 2008.

[3] L. Huan and D. Orban, “Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System,” Cluster, Cloud and Grid Computing (CCGrid), Newport Beach, CA, May 23-26, 2011, pp. 464 - 474.

[4] Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating MapReduce for Multi-core and Multiprocessor Systems,” High Performance Computer Architecture, Scottsdale, AZ, Feb. 10-14, 2007, pp. 13 - 24.

[5] 張德富，平行處理技術，儒林圖書有限公司，1993年9月。

[6] 程海晏，分散式系統入門，維科出版社，1994年9月。

[7] “PVFS,” http://www.pvfs.org/

[8] “Lustre,” http://wiki.lustre.org/index.php/Main_Page/

[9] “Hadoop,” http://hadoop.apache.org/

[10] J. Wei and G. Agrawal, “Ex-MATE Data Intensive Computing with Large Reduction Objects and Its Application to Graph Mining,” Cluster, Cloud and Grid Computing (CCGrid), Newport Beach, CA, May 23-26, 2011, pp. 475 - 484.

[11] J Chao, C. Vecchiola and R. Buyya, “MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms,” IEEE Fourth International Conference on eScience, Indianapolis, IN, Dec. 7-12, 2008, pp. 214 - 221.

[12] G. S. Sadasivam, K. A. Kumari and S. Rubika, “A Novel Authentication Service for Hadoop in Cloud Environment,” Cloud Computing in Emerging Markets (CCEM), Bangalore, India, Oct. 11-12, 2012, pp. 1 - 6.

[13] K. Arun, “GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework,” Concurrency and Computation: Practice and Experience, vol. 17, no. 13, pp. 1607 - 1623, 2005.

[14] A. Matsunaga, M. Tsugawa and J. Fortes, “CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics,” IEEE Fourth International Conference on eScience, Indianapolis, IN, Dec. 7-12, 2008, pp. 222 - 229.

[15] C. Miceli, M. Miceli, S. Jha, H. Kaiser and A. Merzky, “Programming Abstractions for Data Intensive Computing on Clouds and Grids,” Cluster Computing and the Grid, Shanghai, China, May 18-21, 2009, pp. 478 - 483.

[16] S. Papadimitriou and S. Jimeng, “DisCo: Distributed Co-clustering with Map-Reduce,” IEEE International Conference on Data Mining, Pisa, Italy, Dec. 15-19, 2008, pp. 512 - 521.

[17] H. Stockinger, M. Pagni, L. Cerutti and L. Falquet, “Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems,” IEEE International Conference on e-Science and Grid Computing, Amsterdam, The Netherlands, Dec. 4-6, 2006, p. 58.

[18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff and R. Murth, “Hive - a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment VLDB Endowment Hompage archive, vol. 2, no. 2, pp. 1626-1629, 2009.

[19] 王宏仁，“量資料的頭號救星– hadoop,”
http://www.ithome.com.tw/itadm/article.php?c=73977&s=1

[20] 雷萬雲，直達雲端運算的核心－SaaS、IaaS、PaaS 的營運教戰手冊，佳魁資訊，2011年11月。

[21] 楊文誌，雲端運算技術指南，松崗出版商，2010年7月。

[22] S. Ghemawat, H. Gobioff and S. T. Leung, “The Google File System,” ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29 - 43, 2003.

[23] J. Dean and S. Ghemawat, “MapReduce: simplified Data processing on large cluster,” Communications of the ACM, vol. 51, no. 1, pp. 107 - 113, 2008.

[24] C. T. Chu, S. K. Kim, Y. Y. Yu, G. Bradski and K. Olukotun, “Mapreduce for machine learning on multicore,” Annual Conference on Neural Information Processing Systems (NIPS), Vancouver, Canada, Dec. 4-7, 2006, pp. 281 – 288.

[25] S. Kavulya, J. Tany, R. Gandhi and P. Narasimhan, “An Analysis of Traces from a Production MapReduce Cluster,” Cluster, Cloud and Grid Computing (CCGrid), Melbourne, VIC, May 17-20, 2010, pp. 94 - 103.

[26] “Capacity Scheduler,” http://Hadoop.apache.org/docs/r0.19.2/capacity_scheduler.html

[27] “Fair Scheduler,”
http://Hadoop.apache.org/docs/r0.19.2/ capacity_scheduler.html

[28] Z. Peng and Y. Ma “A New Scheduling Algorithm in Hadoop MapReduce,” Communications in Computer and Information Science, vol. 237, pp. 537 - 543, 2011.

[29] S. J. Yang, Y. R. Chen and Y. M. Hsieh, “ Design Dynamic Data Allocation Scheduler to Improve MapReduce Performance in Heterogeneous Clouds,” e-Business Engineering (ICEBE), Hangzhou, China, Sept. 9-11, 2012, pp. 107 - 113.

[30] Z. Dadan, W. Xieqin, and J. Ningkang, “Distributed Scheduling Extension on Hadoop,” Cloud Computing Lecture Notes in Computer Science, vol. 5931, pp. 687 - 693, 2009.

[31] C. Tian, H. Zhou, Y. He and L. Zha. “A Dynamic MapReduce Scheduler for Heterogeneous Workloads,” Grid and Cooperative Computing, Lanzhou, Gansu, Aug. 27-29, 2009, pp. 218 - 224.

[32] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz and I. Stoica, “Improving MapReduce performance in heterogeneous environments,” Symposium on Operating Systems Design and Implementation, San Francisco, USA, Dec. 8-9, 2008, pp. 29 - 42.

[33] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares and X. Qin “Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters,” Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Atlanta, GA, April 19-23, 2010, pp. 1 - 9.

[34] 陳克豪，應用在MapReduce新型負載平衡規劃，中華大學資訊工程學系碩士班碩士論文，2011年7月。

[35] Z. Fadika and M. Govindaraju, “ DELMA: Dynamic Elastic MApReduce Framework for CPU-Intensive Applications,” Cluster, Cloud and Grid Computing (CCGrid), Newport Beach, CA, May 23-26, 2011, pp. 454 - 463.

[36] S. Seo, I. Jang, K. Woo, I. Kim and J. S. Kim, “Prefetching and Pre-shuffling in Shared MapReduce Computation Environment,” Cluster Computing and Workshops, New Orleans, LA, Aug. 31 – Sept. 4, 2009, pp. 1 - 8.

[37] J. Shafer, S. Rixner and A. L. Cox, “The Hadoop distributed filesystem: Balancing portability and performance,” Performance Analysis of Systems & Software (ISPASS), White Plains, NY, March 28-30, 2010, pp. 122 - 133.

[38] A. Chandrasekar, K. Chandrasekar, H. Ramasatagopan, A. R. Rafica, and J. Balasubramaniyan, “Classification Based Metadata Management for HDFS,” High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), Liverpool, UK, June 25-27, 2012, pp. 1021 - 1026.

[39] N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, “High Performance RDMA-based Design of HDFS over InfiniBand,” High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, Nov. 10-16, 2012, pp. 1 - 12.

[40] M. A. Khan, Z. A. Memon and S. Khan, “Highly Available Hadoop NameNode Architecture,” Advanced Computer Science Applications and Technologies (ACSAT), Kuala Lumpur, Malaysia, Nov. 26-28, 2012, pp.167 - 172.

[41] J. Liu, L. Bing, and S. Meina, “The Optimization of HDFS Based on Small Files,” Broadband Network and Multimedia Technology (IC-BNMT), Beijing, China, Oct. 26-28, 2010, pp. 912 - 915.

[42] K. Lu, D. Dai, and M. Sun, “HDFS+: Concurrent Writes Improvements for HDFS,” Cluster, Cloud and Grid Computing (CCGrid), Delft, Netherlands, May 13-16, 2013, pp. 182 - 183.

[43] A. Oriani and I. C. Garcia, “From Backup to Hot Standby: High Availability for HDFS,” Reliable Distributed Systems (SRDS), Irvine, CA, Oct. 8-11, 2012, pp. 131 - 140.

[44] Z. Yang and L. Dan, “Improving the Efficiency of Storing for Small Files in HDFS,” Computer Science & Service System (CSSS), Nanjing, China, Aug. 11-13, 2012, pp. 2239 - 2242.

[45] T. Sandholm, K. Lai, “MapReduce Optimization Using Regulated Dynamic Prioritization,” Measurement and modeling of computer systems, Seattle, USA, June 15-19, 2009, pp. 299 - 314.

[46] A. Abouzeid, K. B. Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin, “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads,” Very Large Data Bases (VLDB), vol. 2, no. 1, pp. 922 - 933, 2009.

[47] B. He,W. Fang, Q. Luo, N. K. Govindaraju and T.Wang, “Mars: A MapReduce Framework on Graphics Processors,” Parallel Architectures and Compilation Technique (PACT), Toronto, Canada, Oct. 25-29, 2008, pp. 260 - 269.

[48] S. Loughran, J. M. Alcaraz Calero, A. Farrell, J. Kirschnick and J. Guijarro, “Dynamic Cloud Deployment of a MapReduce Architecture,” Internet Computing, vol. 16, pp. 40 - 50, 2012.

[49] “YAHOO! Developer Network,” http://developer.yahoo.com/blogs/Hadoop/posts/2008/09/scaling_Hadoop_to_4000_nodes_a/

[50] Tom White, Hadoop: The Definitive Guide , O'Reilly, 2012.

全文公開日期 2018/07/22 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文