簡易檢索 / 詳目顯示

研究生: Dereje Tekilu Aseffa
Dereje - Tekilu Aseffa
論文名稱: A Virtualization-based Hybrid Storage System for A Map-Reduce Framework
A Virtualization-based Hybrid Storage System for A Map-Reduce Framework
指導教授: 吳晉賢
Chin-Hsien Wu
口試委員: Tei-Wei Kuo
Tei-Wei Kuo
Shanq-Jang Ruan
Shanq-Jang Ruan
Chia-Lin Yang
Chia-Lin Yang
Wei-Mei Chen
Wei-Mei Chen
Jenq-Shiou Leu
Jenq-Shiou Leu
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 69
外文關鍵詞: Map-Reduce, Hybrid Storage Systems, Solid-State Drives
相關次數: 點閱:212下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報


A map-reduce framework is popular for big data analysis. In the typical map-reduce framework, both master node and worker nodes can use hard-disk drives (HDDs) as local disks for the map-reduce computation. However, because of the inherit mechanical problems of HDDs, the I/O performance is a bottleneck for the map-reduce framework when I/O-intensive applications (e.g., sorting) are performed. Replacing HDDs with solid-state drives (SSDs) is not economical, although SSDs have better performance than HDDs. In this dissertation, we propose a virtualization-based hybrid storage system for the map-reduce framework. The objective of the dissertation is to combine the advantages of the fast access property of SSDs and the low cost of HDDs by realizing an economical design and improving I/O performance of a map-reduce framework in a virtualization environment. We propose three storage combinations: SSD-based, HDD-based, and a hybrid of SSD-based and HDD-based storage systems which balances speed, capacity, and lifetime. According to experiments, the hybrid of SSD-based and HDD-based storage systems offers superior performance and economy.

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.0.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Big Data Characteristics . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Big Data Application Domains . . . . . . . . . . . . . . . . . 5 2.1.3 Big Data Analysis tools . . . . . . . . . . . . . . . . . . . . . 7 2.2 Map-Reduce Framework . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 SSD Vs HDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Types of Virtualization . . . . . . . . . . . . . . . . . . . . . . 16 3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 A Virtualization-based Hybrid Storage System for a Map-Reduce Framework 23 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.2 Map Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.3 Reduce Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Performance Evaluation . . . . . . . . . . . . . . . . .. . . . . . . 31 4.3.1 Experimental Setup and Metrics . . . . . . . . . . . . . . . . . 31 4.3.2 Experimental Results and Discussion . . . . . . . . . . . . . . 33 5 Conclusion and Future Work . . . . .. . . . . . . . . . . . . . . . . . 47 Bibliography . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 50

[1] B. Hedlund. understanding hadoop clusters and network. [Online]. Available: http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/#download
[2] J. Dean and S. Ghemawat, "Mapreduce: Simpli ed data processing on large clusters", OSDI, 2004.
[3] A. Katal, M. Wazid, and R. H. Goudar, "Big data: Issues,challenges, tools and good practices", in 2013 Sixth International Conference on Contemporary Computing (IC3), Aug 2013, pp. 404 409.
[4] W.-H. K. Seok-Hoon Kang, Dong-Hyun Koo and S.-W. Lee, "A case for fl ash memory ssd in hadoop applications", International Journal of Control and Automation, vol. 6, no. 1, 2013.
[5] X. L. J. J. H. S. H. W. Md. Wasi-ur Rahman, Nusrat Sharmin Islam and D. K. D. Panda, "High-performance rdma-based design of hadoop mapreduce over infi niband", in IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, 2013, pp. 1908 1917.
[6] K. Kambatla and Y. Chen, "The truth about mapreduce performance on ssds", in 28th Large Installation System Administration Conference (LISA14), 2014.
[7] C.-K. Kang, Y.-J. Cai, C.-H. Wu, and P.-C. Hsiu, "A hybrid storage access framework for high-performance virtual machines", ACM Trans. Embed. Comput. Syst., vol. 13, no. 5s, pp. 157:1 157:24, Ot. 2014. [Online]. Available: http://doi.acm.org/10.1145/2660493
[8] Y.-J. Cai, C.-K. Kang, and C.-H. Wu, "A virtual storage environment for ssds and hdds in xen hypervisor", SIGBED Rev., vol. 11, no. 2, pp. 39 44, Sep. 2014. [Online℄. Available: http://doi.acm.org/10.1145/2668138.2668144
[9] S. L. J. H. J. A. M. K. S. X. Sang-Woo Jun, Ming Liu and Arvind, "Bluedbm: An appliance for big data analytics", in International Symposium on Computer Architecture (ISCA), 2015.
[10] C. M. P. Demchenko, Y.; De Laat, "Dedfi ning architecture components of the big data ecosystem", in International Conference on Collaboration Technologies and Systems (CTS), 2014, pp. 104 112.
[11] Gartner. Gartner's 2014 hype cycle for emerging technologies maps the journey to digital business. [Online]. Available: http://www.gartner.com/newsroom/id/2819918
[12] B. Marr. Big data: The 5 vs everyone must know. [Online]. Available: https://www.linkedin.com/pulse/20140306073407-64875646-big-data-the-5-vs-everyone-must-know
[13] . How big data is changing health care. [Online]. Available: http://www.forbes.com/sites/bernardmarr/2015/04/21/how-big-data-is-changing-healthcare/
[14] Big data briefi ng. [Online]. Available: https://www.ul.a.uk/public-policy/public-policy-brie ngs/big_data_briefi ng_fi nal.pdf
[15] M. Scarf . Social media and the big data explosion. [Online]. Available: http://www.forbes.com/sites/onmarketing/2012/06/28/social-media-and-the-big-data-explosion/
[16] S. Manikandan, S.G.and Ravi, "Big data analysis using apache hadoop", in International Conference on IT Convergen
ce and Security (ICITCS), 2014, pp.1 4.
[17] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans,T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley,S. Radia, B. Reed, and E. Baldeschwieler, "Apache hadoop yarn: Yet another resource negotiator", in Proceedings of the 4th Annual Symposium on Cloud Computing, ser. SOCC '13. New York, NY, USA: ACM, 2013, pp. 5:1 5:16.[Online]. Available: http://doi.acm.org/10.1145/2523616.2523633
[18] K. Sundaravarathan, A. B. Bhat, and P. Martin, "A study of three mapreduce frameworks", in Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, ser. CASCON'51. Riverton, NJ, USA: IBM Corp., 2015, pp. 16 25. [Online]. Available: http://dl.acm.org/citation.fm?id=2886444.2886448
[19] K. Wang and M. M. H. Khan, "Performance prediction for apache spark platform", in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS),2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, Aug 2015, pp. 166-173.
[20] tutorialspoint. Apache spark introduction. [Online]. Available: http://www.tutorialspoint.com/apache_spark/apache_spark_introduction.htm
[21] S. Penchikala. Big data processing with apache spark - part 1: Introduction. [Online]. Available: https://www.infoq.com/articles/apache-spark-introduction
[22] L. Eskandari, Z. Huang, and D. Eyers, " P-scheduler: Adaptive hierarchical scheduling in apache storm", in Proceedings of the Australasian Computer Science Week Multiconference, ser. ACSW '16. New York, NY, USA: ACM,
2016, pp. 26:1 26:10. [Online]. Available: http://doi.acm.org/10.1145/2843043.2843056
[23] tutorialspoint. Apache storm. [Online]. Available: http://www.tutorialspoint.com/apache_storm/apache_storm_tutorial.pdf
[24] O. L. S. S. W. Dawei Jiang, Beng Chin, "The performance of mapreduce: an in-depth study", in Proceedings of the VLDB Endowment, 36th International Conference on Very Large Data Bases, 2010.
[25] wikipedia. Hard disk drives. [Online℄. Available: http://en.wikipedia.org/wiki/Hard_disk_drive
[26] ptechguide. Hard disk drives performance. [Online]. Available: http://www.ptechguide.com/hard-disks/hard-disk-hard-drive-performance-transfer-rates-latency-and-seek-times
[27] wikipedia. Solid state drives. [Online]. Available: http://en.wikipedia.org/wiki/Solid-state_drive
[28] T.-S. C. Rizvi, S.S., "Flash ssd vs hdd: High performance oriented modern embedded and multimedia storage systems", in 2nd International Conference on Computer Engineering and Technology (ICCET), 2010.
[29] S. Domingo. Ssd vs hdd: What's the difference? [Online]. Available: http://sea.pcmag.com/storage-devices-reviews/1526/feature/ssd-vs-hdd-whats-the-difference
[30] T.-. C. Susanta Nanda. A survey on virtualization technologies. [Online]. Available: http://www.esl.s.sunysb.edu/tr/TR179.pdf
[31] Carpathia. Virtualization:what it is, what types there are and how it bene fit companies. [Online]. Available: http://
carpathia.com/blog/virtualization-what-is-it-what-types-there-are-and-how-it-benefi ts-companies/
[32] Wikipedia. Application virtualization. [Online]. Available: https://en.wikipedia.org/wiki/Application_virtualization
[33] . Desktop virtualization. [Online]. Available: https://en.wikipedia.org/wiki/Desktop_virtualization
[34] . User virtualization. [Online]. Available: https://en.wikipedia.org/wiki/User_virtualization
[35] techopedia. Storage virtualization. [Online]. Available: https://www.techopedia.com/defi nition/4798/storage-virtualization
[36] W. Y. D. G. Y. Wang, X. Que and D. Sehgal, " Hadoop acceleration through network levitated merge", in In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011.
[37] S. Moon, J. Lee, and Y. S. Kee, "Introducing ssds to the hadoop mapreduce framework", in 2014 IEEE 7th International Conference on Cloud Computing, June 2014, pp. 272 279.
[38] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson, "Raid: High-performance, reliable secondary storage", ACM Comput. Surv., vol. 26, no. 2, pp. 145 185, Jun. 1994. [Online]. Available: http://doi.acm.org/10.1145/176979.176981
[39] F. Chen, D. A. Koufaty, and X. Zhang, "Hystor: Making the best use of solid state drives in high performance storage systems", in Proceedings of the International Conference on Super computing, ser. ICS'11. New York, NY, USA: ACM, 2011, pp. 22-32. [Online]. Available: http://doi.acm.org/10.1145/1995896.1995902
[40] S. Gupta and M. Rogers, " Exploring forensic implications of the fusion drive", JDFSL, vol. 9, no. 2, 2014.
[41] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. M Cauley, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing", in Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation. USENIX Association, 2012, pp. 2 2.
[42] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks, in ACM SIGOPS Operating Systems Review, vol. 41, no. 3. ACM, 2007, pp. 59 72.
[43] S. S. G. V. A. M. Bikas Saha, Hitesh Shah and C. Curino., " Apache tez: A unifying framework for modeling and building data processing applications", in In Proceedings of the ACM SIGMOD International Conference on Management
of Data (SIGMOD '15), 2015.
[44] K. K. S. Lee, T. Kim and J. Kim, "Lifetime management of fl ash-based ssds using recovery-aware dynamic throttling", in in Proceedings of the 10th USENIX on File and Storage Technologies, 2012.
[45] A. K. Joseph Issa, "Disk i/o performance-per-watt analysis for cloud computing", International Journal of Computer Applications, vol. 97, no. 3, 2014.
[46] E. H. Edward Bortnikov, Ari Frank and S. Rao, "Predicting execution bottlenecks in map-reduce clusters", in Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing, 2012.
[47] S. H. T. N. R. B. P. K. Subramani, R., "Garbage collection algorithms for nand flash memory devices an overview", in European Modelling Symposium (EMS), 2013, pp. 81 86.

QR CODE