研究生: |
楊明憲 Ming-hsien Yang |
---|---|
論文名稱: |
高效能異質性Hadoop架構 High-Performance Heterogeneous Hadoop Architecture |
指導教授: |
徐勝均
Sendren Sheng-Dong Xu |
口試委員: |
陳佳堃
Jia-kun Chen 李俊賢 Jin-shyan Lee |
學位類別: |
碩士 Master |
系所名稱: |
工程學院 - 自動化及控制研究所 Graduate Institute of Automation and Control |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 中文 |
論文頁數: | 93 |
中文關鍵詞: | Hadoop 、HDFS 、MapReduce |
外文關鍵詞: | Hadoop, HDFS, MapReduce |
相關次數: | 點閱:943 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
原生Hadoop是由master與slave所組成的「兩層式」架構,master主要負責管理slave,slave被視作資料節點(DataNode)與運算節點(TaskTracker)之結合。使用者可利用增加slave節點以提昇巨量資料平行運算之效能,故長期以來,Hadoop被視作處理巨量資料的關鍵技術。雖然Hadoop可提升效能,但是由大量slave所組成的Hadoop叢集(cluster)卻會造成實體架構較為龐大與較多的能源消耗。因此本文利用ARM架構低耗能、對大量資料處理之高效能以及體積小之特性與x86組合成全新的Hadoop「異質性三層」架構,並將「動態分配檔案區塊演算法」概念導入任務排程規劃,不僅改善原架構之缺點,亦有效縮短22%以上Map/Reduce的運算時間。
Native Hadoop is a two-layered structure composed of one master and many slaves. Therein slave can be seen as the combination of DataNode and TaskTracker, while master is in charge of managing slave nodes. Since users may add more slave nodes in the Hadoop to increase the efficiency of parallel computing of massive data, Hadoop has been viewed as the key technology of massive data processing. Although Hadoop can promote the efficiency, the Hadoop cluster, composed of a large number of slaves, makes the real structure larger and consumes more energy. Therefore, this study combines ARM and x86 to form the new “Heterogeneously Three-Layered” Hadoop structure, based on ARM's characteristics which owns the characteristics: energy saving, high performance of massive data processing, and small space. Moreover, the concept of “Dynamically Managing Block Algorithm” is introduced to the task scheduler. This design not only can improve shortcomings in native Hadoop but also can effectively reduce more than 22% Map/Reduce operation time.
[1] “Maxtor,” http://www.seagate.com/
[2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes and R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” ACM Transactions on Computer Systems, vol. 26, no. 2, Article 4, 2008.
[3] L. Huan and D. Orban, “Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System,” Cluster, Cloud and Grid Computing (CCGrid), Newport Beach, CA, May 23-26, 2011, pp. 464 - 474.
[4] Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating MapReduce for Multi-core and Multiprocessor Systems,” High Performance Computer Architecture, Scottsdale, AZ, Feb. 10-14, 2007, pp. 13 - 24.
[5] 張德富,平行處理技術,儒林圖書有限公司,1993年9月。
[6] 程海晏,分散式系統入門,維科出版社,1994年9月。
[7] “PVFS,” http://www.pvfs.org/
[8] “Lustre,” http://wiki.lustre.org/index.php/Main_Page/
[9] “Hadoop,” http://hadoop.apache.org/
[10] J. Wei and G. Agrawal, “Ex-MATE Data Intensive Computing with Large Reduction Objects and Its Application to Graph Mining,” Cluster, Cloud and Grid Computing (CCGrid), Newport Beach, CA, May 23-26, 2011, pp. 475 - 484.
[11] J Chao, C. Vecchiola and R. Buyya, “MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms,” IEEE Fourth International Conference on eScience, Indianapolis, IN, Dec. 7-12, 2008, pp. 214 - 221.
[12] G. S. Sadasivam, K. A. Kumari and S. Rubika, “A Novel Authentication Service for Hadoop in Cloud Environment,” Cloud Computing in Emerging Markets (CCEM), Bangalore, India, Oct. 11-12, 2012, pp. 1 - 6.
[13] K. Arun, “GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework,” Concurrency and Computation: Practice and Experience, vol. 17, no. 13, pp. 1607 - 1623, 2005.
[14] A. Matsunaga, M. Tsugawa and J. Fortes, “CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics,” IEEE Fourth International Conference on eScience, Indianapolis, IN, Dec. 7-12, 2008, pp. 222 - 229.
[15] C. Miceli, M. Miceli, S. Jha, H. Kaiser and A. Merzky, “Programming Abstractions for Data Intensive Computing on Clouds and Grids,” Cluster Computing and the Grid, Shanghai, China, May 18-21, 2009, pp. 478 - 483.
[16] S. Papadimitriou and S. Jimeng, “DisCo: Distributed Co-clustering with Map-Reduce,” IEEE International Conference on Data Mining, Pisa, Italy, Dec. 15-19, 2008, pp. 512 - 521.
[17] H. Stockinger, M. Pagni, L. Cerutti and L. Falquet, “Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems,” IEEE International Conference on e-Science and Grid Computing, Amsterdam, The Netherlands, Dec. 4-6, 2006, p. 58.
[18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff and R. Murth, “Hive - a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment VLDB Endowment Hompage archive, vol. 2, no. 2, pp. 1626-1629, 2009.
[19] 王宏仁,“量資料的頭號救星– hadoop,”
http://www.ithome.com.tw/itadm/article.php?c=73977&s=1
[20] 雷萬雲,直達雲端運算的核心-SaaS、IaaS、PaaS 的營運教戰手冊,佳魁資訊,2011年11月。
[21] 楊文誌,雲端運算技術指南,松崗出版商,2010年7月。
[22] S. Ghemawat, H. Gobioff and S. T. Leung, “The Google File System,” ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29 - 43, 2003.
[23] J. Dean and S. Ghemawat, “MapReduce: simplified Data processing on large cluster,” Communications of the ACM, vol. 51, no. 1, pp. 107 - 113, 2008.
[24] C. T. Chu, S. K. Kim, Y. Y. Yu, G. Bradski and K. Olukotun, “Mapreduce for machine learning on multicore,” Annual Conference on Neural Information Processing Systems (NIPS), Vancouver, Canada, Dec. 4-7, 2006, pp. 281 – 288.
[25] S. Kavulya, J. Tany, R. Gandhi and P. Narasimhan, “An Analysis of Traces from a Production MapReduce Cluster,” Cluster, Cloud and Grid Computing (CCGrid), Melbourne, VIC, May 17-20, 2010, pp. 94 - 103.
[26] “Capacity Scheduler,” http://Hadoop.apache.org/docs/r0.19.2/capacity_scheduler.html
[27] “Fair Scheduler,”
http://Hadoop.apache.org/docs/r0.19.2/ capacity_scheduler.html
[28] Z. Peng and Y. Ma “A New Scheduling Algorithm in Hadoop MapReduce,” Communications in Computer and Information Science, vol. 237, pp. 537 - 543, 2011.
[29] S. J. Yang, Y. R. Chen and Y. M. Hsieh, “ Design Dynamic Data Allocation Scheduler to Improve MapReduce Performance in Heterogeneous Clouds,” e-Business Engineering (ICEBE), Hangzhou, China, Sept. 9-11, 2012, pp. 107 - 113.
[30] Z. Dadan, W. Xieqin, and J. Ningkang, “Distributed Scheduling Extension on Hadoop,” Cloud Computing Lecture Notes in Computer Science, vol. 5931, pp. 687 - 693, 2009.
[31] C. Tian, H. Zhou, Y. He and L. Zha. “A Dynamic MapReduce Scheduler for Heterogeneous Workloads,” Grid and Cooperative Computing, Lanzhou, Gansu, Aug. 27-29, 2009, pp. 218 - 224.
[32] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz and I. Stoica, “Improving MapReduce performance in heterogeneous environments,” Symposium on Operating Systems Design and Implementation, San Francisco, USA, Dec. 8-9, 2008, pp. 29 - 42.
[33] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares and X. Qin “Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters,” Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Atlanta, GA, April 19-23, 2010, pp. 1 - 9.
[34] 陳克豪,應用在MapReduce新型負載平衡規劃,中華大學資訊工程學系碩士班碩士論文,2011年7月。
[35] Z. Fadika and M. Govindaraju, “ DELMA: Dynamic Elastic MApReduce Framework for CPU-Intensive Applications,” Cluster, Cloud and Grid Computing (CCGrid), Newport Beach, CA, May 23-26, 2011, pp. 454 - 463.
[36] S. Seo, I. Jang, K. Woo, I. Kim and J. S. Kim, “Prefetching and Pre-shuffling in Shared MapReduce Computation Environment,” Cluster Computing and Workshops, New Orleans, LA, Aug. 31 – Sept. 4, 2009, pp. 1 - 8.
[37] J. Shafer, S. Rixner and A. L. Cox, “The Hadoop distributed filesystem: Balancing portability and performance,” Performance Analysis of Systems & Software (ISPASS), White Plains, NY, March 28-30, 2010, pp. 122 - 133.
[38] A. Chandrasekar, K. Chandrasekar, H. Ramasatagopan, A. R. Rafica, and J. Balasubramaniyan, “Classification Based Metadata Management for HDFS,” High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), Liverpool, UK, June 25-27, 2012, pp. 1021 - 1026.
[39] N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda, “High Performance RDMA-based Design of HDFS over InfiniBand,” High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, Nov. 10-16, 2012, pp. 1 - 12.
[40] M. A. Khan, Z. A. Memon and S. Khan, “Highly Available Hadoop NameNode Architecture,” Advanced Computer Science Applications and Technologies (ACSAT), Kuala Lumpur, Malaysia, Nov. 26-28, 2012, pp.167 - 172.
[41] J. Liu, L. Bing, and S. Meina, “The Optimization of HDFS Based on Small Files,” Broadband Network and Multimedia Technology (IC-BNMT), Beijing, China, Oct. 26-28, 2010, pp. 912 - 915.
[42] K. Lu, D. Dai, and M. Sun, “HDFS+: Concurrent Writes Improvements for HDFS,” Cluster, Cloud and Grid Computing (CCGrid), Delft, Netherlands, May 13-16, 2013, pp. 182 - 183.
[43] A. Oriani and I. C. Garcia, “From Backup to Hot Standby: High Availability for HDFS,” Reliable Distributed Systems (SRDS), Irvine, CA, Oct. 8-11, 2012, pp. 131 - 140.
[44] Z. Yang and L. Dan, “Improving the Efficiency of Storing for Small Files in HDFS,” Computer Science & Service System (CSSS), Nanjing, China, Aug. 11-13, 2012, pp. 2239 - 2242.
[45] T. Sandholm, K. Lai, “MapReduce Optimization Using Regulated Dynamic Prioritization,” Measurement and modeling of computer systems, Seattle, USA, June 15-19, 2009, pp. 299 - 314.
[46] A. Abouzeid, K. B. Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin, “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads,” Very Large Data Bases (VLDB), vol. 2, no. 1, pp. 922 - 933, 2009.
[47] B. He,W. Fang, Q. Luo, N. K. Govindaraju and T.Wang, “Mars: A MapReduce Framework on Graphics Processors,” Parallel Architectures and Compilation Technique (PACT), Toronto, Canada, Oct. 25-29, 2008, pp. 260 - 269.
[48] S. Loughran, J. M. Alcaraz Calero, A. Farrell, J. Kirschnick and J. Guijarro, “Dynamic Cloud Deployment of a MapReduce Architecture,” Internet Computing, vol. 16, pp. 40 - 50, 2012.
[49] “YAHOO! Developer Network,” http://developer.yahoo.com/blogs/Hadoop/posts/2008/09/scaling_Hadoop_to_4000_nodes_a/
[50] Tom White, Hadoop: The Definitive Guide , O'Reilly, 2012.