簡易檢索 / 詳目顯示

研究生: 顏齊
Chi Yen
論文名稱: 結合單目標優化與隨機森林法實現 Apache Spark 全域參數自動化決定
Combination of single-objective optimization and random forest for apache spark global parameters automatic determination
指導教授: 花凱龍
Kai-Lung Hua
楊朝龍
Chao-Lung Yang
口試委員: 花凱龍
Kai-Lung Hua
楊朝龍
Chao-Lung Yang
沈上翔
Shan-Hsiang Shen
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 43
中文關鍵詞: 阿帕奇星火隨機森林基因演算法投影抽樣混合式抽樣全域參數自動配置
外文關鍵詞: Spark, Random forest, Genetic algorithm, Projective sampling, Hybrid sampling, Automatically configuration for global parameters
相關次數: 點閱:163下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Apache Spark是一個開源的分散式計算系統,專為處理大規模數據集和執行高效能數據處理任務而設計,Spark提供了一個易於使用的程式開發框架,並在記憶體中保持數據,以提供快速的數據處理速度。Spark的特點之一是它擁有眾多的參數,而這些參數對於調整系統效能至關重要,許多先前研究中已指出,Spark參數配置將嚴重影響Spark工作負載效能,但Spark參數總量高達150個,在如此多的參數下取得最佳參數配置是一大難題。本文提出一種名為Global Automatic configuration method for Spark(GACS)的Spark全域參數自動優化方法,使Spark能夠在不同類型的工作負載效能間取得平衡,進而得到適用於各工作負載的最佳參數。我們分別使用5個及8個不同類型的工作負載作為開發與驗證基準,並實作在Google Cloud Platform雲端的環境。
    實驗結果顯示與GCP_Default相比,GACS在D3大小的開發與驗證基準下分別能夠實現1.12x與1.36x的平均加速,且GACS在絕大多數工作負載皆能取得比GCP_Default更優秀的效能表現,另外我們發現GACS在效能表現上擁有隨著數據量增大而提升的趨勢,這對於大數據運算框架來說是非常有利的趨勢。


    Apache Spark is an open-source distributed computing system, designed specifically for handling large-scale datasets and executing high-performance data processing tasks. Spark offers an easy-to-use programming framework and keeps data in memory to provide fast data processing speed. One characteristic of Spark is its vast array of parameters, which are crucial for tuning system performance. Many previous studies have pointed out that Spark parameter configuration significantly affects Spark workload performance. However, finding the optimal configuration among many parameters is a significant challenge since the total number of Spark parameters is as high as 150. This paper proposes a method called Global Automatic configuration method for Spark (GACS), enabling Spark to balance performance across different types of workloads and thus achieving optimal parameters applicable for various workloads. In the experiment, 5 and 8 different types of workloads are used as development and validation benchmarks respectively, and the system is implemented in Google Cloud Platform environment. Experimental results showed that, compared to GCP_Default, GACS can achieve an average acceleration of 1.12x and 1.36x under D3 size development and validation benchmarks respectively. GACS outperformed GCP_Default in the majority of workloads. Additionally, GACS exhibits a trend of increasing performance as data volume increases, which is a highly beneficial trend for big data computing frameworks.

    Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Spark Parameter Tuning via Trial-and-Error . . . . . . . . . . . . . . 5 2.2 Configuring In-memory Cluster Computing using Random Forest . . 6 2.3 Efficient Performance Prediction for Apache Spark . . . . . . . . . . 7 3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Projective Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Hybrid Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Benchmark of Development and Validation . . . . . . . . . . . . . . . 16 4.3 Spark Configuration Parameter . . . . . . . . . . . . . . . . . . . . . 18 4.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1 GACS Optimized Parameters . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Result of Developing Benchmark . . . . . . . . . . . . . . . . . . . . 23 5.3 Result of Validating Benchmark . . . . . . . . . . . . . . . . . . . . . 26 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    [1] M. Hilbert and P. López, “The world's technological capacity to store, communicate, and compute information,” science, vol. 332, no. 6025, pp. 60–65, 2011.
    [2] H. Zhang, G. Chen, B. C. Ooi, K.-L. Tan, and M. Zhang, “In-memory big data management and processing: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 7, pp. 1920–1948, 2015.
    [3] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al., “Apache spark: a unified engine for big data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, 2016.
    [4] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., “Spark sql: Relational data processing in spark,” in Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1383–1394, 2015.
    [5] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: Faulttolerant streaming computation at scale,” in Proceedings of the twenty-fourth ACM symposium on operating systems principles, pp. 423–438, 2013.
    [6] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al., “Mllib: Machine learning in apache spark,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016.
    [7] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “Graphx: A resilient distributed graph system on spark,” in First international workshop on graph data management experiences and systems, pp. 1–6, 2013.
    [8] A. V. Hazarika, G. J. S. R. Ram, and E. Jain, “Performance comparision of hadoop and spark engine,” in 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 671–674, 2017.
    [9] Y. Samadi, M. Zbakh, and C. Tadonki, “Performance comparison between hadoop and spark frameworks using hibench benchmarks,” Concurrency and Computation: Practice and Experience, vol. 30, no. 12, p. e4367, 2018.
    [10] D. García-Gil, S. Ramírez-Gallego, S. García, and F. Herrera, “A comparison on scalability for batch big data processing on apache spark and apache flink,” Big Data Analytics, vol. 2, no. 1, pp. 1–11, 2017.
    [11] A. Davidson and A. Or, “Optimizing shuffle performance in spark,” University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep, 2013.
    [12] N. Nguyen, M. M. H. Khan, Y. Albayram, and K. Wang, “Understanding the influence of configuration settings: An execution model-driven framework for apache spark platform,” in 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 802–807, IEEE, 2017.
    [13] Z. Bei, Z. Yu, N. Luo, C. Jiang, C. Xu, and S. Feng, “Configuring in-memory cluster computing using random forest,” Future Generation Computer Systems, vol. 79, pp. 1–15, 2018.
    [14] A. Gounaris and J. Torres, “A methodology for spark parameter tuning,” Big data research, vol. 11, pp. 22–32, 2018.
    [15] P. Petridis, A. Gounaris, and J. Torres, “Spark parameter tuning via trial-and-error,” in Advances in Big Data: Proceedings of the 2nd INNS Conference on Big Data, October 23-25, 2016, Thessaloniki, Greece 2, pp. 226–237, Springer, 2017.
    [16] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: lightning-fast big data analysis. ” O’Reilly Media, Inc.”, 2015.
    [17] A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How data volume affects spark based data analytics on a scale-up server,” in Big Data Benchmarks, Performance Optimization, and Emerging Hardware: 6th Workshop, BPOE 2015, Kohala, HI, USA, August 31-September 4, 2015. Revised Selected Papers 6, pp. 81–92, Springer, 2016.
    [18] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The hibench benchmark suite: Characterization of the mapreduce-based data analysis,” in 2010 IEEE 26th International conference on data engineering workshops (ICDEW 2010), pp. 41–51, IEEE, 2010.
    [19] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001.
    [20] M. Kumar, D. Husain, N. Upreti, D. Gupta, et al., “Genetic algorithm: Review and application,” Available at SSRN 3529843, 2010.
    [21] M. Last, “Improving data mining utility with projective sampling,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 487– 496, 2009.
    [22] G. Cheng, S. Ying, B. Wang, and Y. Li, “Efficient performance prediction for apache spark,” Journal of Parallel and Distributed Computing, vol. 149, pp. 40–51, 2021.
    [23] G. Cheng, S. Ying, and B. Wang, “Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model,” Journal of Systems and Software, vol. 180, p. 111028, 2021.
    [24] A. Sarkar, J. Guo, N. Siegmund, S. Apel, and K. Czarnecki, “Cost-efficient sampling for performance prediction of configurable systems (t),” in 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 342–352, IEEE, 2015.
    [25] R. E. Schapire, “Explaining adaboost,” Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik, pp. 37–52, 2013.
    [26] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Presented as part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pp. 15–28, 2012.

    無法下載圖示 全文公開日期 2028/06/16 (校內網路)
    全文公開日期 2033/06/16 (校外網路)
    全文公開日期 2033/06/16 (國家圖書館:臺灣博碩士論文系統)
    QR CODE