簡易檢索 / 詳目顯示

研究生: 陳彥均
Yan-Jun Chen
論文名稱: 整合Hadoop與Spark數據分析平台之建構與應用-基於政府公開資料之空氣品質預測為例
Development and application of a data analysis platform based on the integration of Hadoop and Spark–a case study of air quality forecast based on government Open Data
指導教授: 陳鴻銘
Hung-Ming Chen
口試委員: 林祐正
YU-CHENG LIN
謝佑明
Yo-Ming Hsieh
陳鴻銘
Hung-Ming Chen
學位類別: 碩士
Master
系所名稱: 工程學院 - 營建工程系
Department of Civil and Construction Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 125
中文關鍵詞: 雲端運算巨量資料分析時間序列隨機森林樹Spark資料探勘
外文關鍵詞: Cloud Computing, Big data analysis, Data Mining, Time Series Analysis, Random Forest, Spark
相關次數: 點閱:366下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來大數據議題持續發展,隨著監測資料取得的成本逐年降低,以及資料儲存的成本下降。雲端平台的資料儲存技術以及資料分析技術勢必成為未來之重要發展項目之一。本研究基於現有開源軟體Hadoop以及Spark進行整合建立雲端平台,並且提供資料儲存及資料分析使用。在案例上採用政府公開資料做使用,採用環境保護署之空氣品質監測資料做使用,達到政府資料活化應用之目的,且藉由此資料能有效測試本平台在數據分析上之效能及活用性。
    在資料分析方面採用兩種分析模式進行分析。一是時間序列分析,時間序列分析可使數據藉由自身歷史資訊來建立模型,並且預測為來數據相關數值。二是整合時間序列分析與隨機森林回歸樹之數值預測模式,首先藉由時間序列分析取得資料特徵值之預測數據,再藉由此預測數據放入隨機森林樹所建立之數據模型進行預測。在此種預測模式下,能夠藉由有效數據量的增加,增加模型之準確度,有效提升預測數值之預測。最後探討本研究所建立之平台分析效能,藉由相同的數據在不同版本之隨機森林樹模型建力訓練時間做比較,說明本研究所提供之雲端平台能有效降低數據在分析上之模型訓練時間。


    In recent years, big data issue is continue growing, with the cost of data acquisition and data storage decreasing year by year. The data storage technology and data analysis technology of the cloud platform are bound to become one of the important development projects in the future. Based on the existing open source software, Hadoop and Spark, this study integrates and builds a cloud platform to provide data storage and data analysis services. In the case study, one of the government's public information, which is the environmental quality monitoring data from the Environmental Protection Agency, is adopted to be analyzed for activating the government data. Base on the case study, the effectiveness and usability of the cloud platform in data analysis can be tested and verified.
    Two analysis models are used for data analysis in the case study. One is the time series analysis, which can develop model by historical information to forecast future data values. Second is the prediction model by the integration of the time series analysis and the random forest regression tree. The time series analysis is firstly used to obtain the prediction data of the data eigenvalues, and then the random forest tree is used to predict the target data based on the predicted eigenvalues. By using this prediction model, the accuracy of the prediction can be improved by increasing the amount of the effective data. Finally, this study explores the performance of analysis platform established in this research by comparing the model training time of different versions of random forest tree models. The result shows the cloud platform developed in this research can effectively reduce the model training time of data analysis.

    論文摘要................................................................................................................. V 目錄 ........................................................................................................................ XI 圖索引 .................................................................................................................. XV 表索引 ................................................................................................................. XIX 第一章 緒論 ................................................................................................ 1 1.1 研究動機 ............................................................................................................... 1 1.2 研究目的 ............................................................................................................... 4 1.3 研究範圍 ............................................................................................................... 5 1.4 研究方法 ............................................................................................................... 6 1.5 論文架構 ............................................................................................................... 7 第二章 研究背景 .................................................................................................. 9 2.1 文獻回顧 ............................................................................................................... 9 2.1.1 雲端計算平台 .......................................................................................... 9 2.1.2 政府公開數據 ........................................................................................ 10 2.1.3 巨量資料分析 ........................................................................................ 15 2.1.4 資料探勘 ................................................................................................ 17 2.2 系統開發技術 .................................................................................................... 19 2.2.1 Apache Hadoop ...................................................................................... 20 2.2.2 Apache Hadoop YARN ............................................................................ 22 2.2.3 SPARK MLlib .......................................................................................... 24 2.3 系統開發工具 .................................................................................................... 26 2.3.1 Python ..................................................................................................... 26 2.3.2 HTML ......................................................................................................... 26 2.3.3 PHP ........................................................................................................... 27 第三章 系統架構與運作機制 ........................................................................... 29 XII 3.1系統架構 .............................................................................................................. 29 3.2資料上傳模式 ..................................................................................................... 30 3.3資料前處理 .......................................................................................................... 32 3.4 演算法之分析模式 ............................................................................................ 32 3.4.1 時間序列演算法 .................................................................................... 33 3.4.2 SPARK MLlib 之隨機森回歸樹 ........................................................... 35 3.4.3 時間序列演算法整合隨機森林回歸樹模型之預測 .......................... 36 3.5巨量資料分析 ..................................................................................................... 37 3.5.1資料探勘運作機制 ................................................................................. 37 3.5.2資料分析結果驗證 ................................................................................. 38 第四章 巨量分析應用實例 ............................................................................... 41 4.1 分析資料 ............................................................................................................. 41 4.1.1 資料來源 ................................................................................................ 41 4.1.2 資料收集範圍 ........................................................................................ 41 4.2 資料處理及流程 ................................................................................................ 43 4.2.1 資料前處理 ............................................................................................ 44 4.2.2 時間序列預測 ........................................................................................ 46 4.2.3 隨機森林決策樹 .................................................................................... 47 4.3 模型建立過程 .................................................................................................... 48 4.3.1 時間序列方法 ........................................................................................ 48 4.3.1.1 方法之參數設定 .............................................49 4.3.1.2 方法結果之呈現 .............................................52 4.3.2 隨機森林回歸樹結合時間序列特徵值預測方法 .............................. 53 4.3.2.1 隨機森林回歸樹之參數設定 .........................53 4.3.2.2 隨機森林模型結果之呈現 .............................55 4.3.2.3 時間序列特徵值建立與導入 .........................56 4.4 分析結果 ............................................................................................................. 59 XIII 4.4.1 PM2.5預測(時間序列法) .................................................................... 59 4.4.2隨機森林回歸樹結合時間序列特徵值預測結果比較 ....................... 68 4.4.3 平台預測效能比較 ................................................................................ 75 第五章 結論與未來展望 ................................................................................... 77 5.1 結論 ..................................................................................................................... 77 5.2未來展望 .............................................................................................................. 79 參考文獻................................................................................................................ 81 附錄 ........................................................................................................................ 87

    [1] internet-in-real-time. 2014; Available from: http://pennystocks.la/internet-in-real-time/.
    [2] 李永正, 如何在大數據時代發揮開放資料的社會價值. 臺灣經濟研究月刊, 2015. 38(9): p. 105-112.
    [3] 行政院環境保護署. 環境資源資料庫(2016). Available from: http://erdb.epa.gov.tw.
    [4] 俞淑惠、袁菁、甯蜀光、銀慶剛、羅夢娜, 環保署/國科會空污防制科研合作計畫. 2013.
    [5] google cloud. Available from: https://cloud.google.com/.
    [6] amazon aws. Available from: https://aws.amazon.com/.
    [7] microsoft azure. Available from: https://azure.microsoft.com/en-us/?v=18.20.
    [8] 政府資料開放平台. Available from: https://data.gov.tw.
    [9] 鍾智林 and 黃晏珊, 開放式數據為基礎之公共自行車營運特性分析: 以臺北 YouBike 為例. 運輸學刊, 2016. 28(4): p. 455-478.
    [10] 陳珍華, 巨量資料 : 公開資料與房仲網的房價分析, in 資訊學院資訊學程. 2014, 國立交通大學: 新竹市. p. 39.
    [11] 國家發展委員會. 政府資料開放. Available from: https://www.ndc.gov.tw/Content_List.aspx?n=9B973A5871579AC7.
    [12] Kumar, U. and V. Jain, ARIMA forecasting of ambient air pollutants (O 3, NO, NO 2 and CO). Stochastic
    82
    Environmental Research and Risk Assessment, 2010. 24(5): p. 751-760.
    [13] Wang, X.-K. and W.-Z. Lu, Seasonal variation of air pollution index: Hong Kong case study. Chemosphere, 2006. 63(8): p. 1261-1272.
    [14] 張立農, 江孟玲, and 林昭遠, 台灣交通空氣品質監測站 PM10 變異影響因素之研究. Journal of Soil and Water Conservation, 2015. 47(1): p. 1235-1246.
    [15] Laney, D., 3-D Data Management: Controlling Data Volume, Velocity, and Variety. Vol. 6. 2001.
    [16] Beyer, M.A. and D. Laney, The importance of ‘big data’: a definition. Stamford, CT: Gartner, 2012: p. 2014-2018.
    [17] Claverie-Berge, I., Solutions Big Data IBM. 2012.
    [18] Singh, K., et al., Big Data Analytics framework for Peer-to-Peer Botnet detection using Random Forests. Information Sciences, 2014. 278: p. 488-497.
    [19] Apache™. Hadoop. Available from: http://hadoop.apache.org/.
    [20] Apache™. HIVE. Available from: https://hive.apache.org/.
    [21] Apache™. Mahout. Available from: https://mahout.apache.org/.
    [22] Ackermann, K. and S.D. Angus, A Resource Efficient Big Data Analysis Method for the Social Sciences: The Case of Global IP Activity. Procedia Computer Science, 2014. 29: p. 2360-2369.
    83
    [23] Steed, C.A., et al., Big data visual analytics for exploratory earth system simulation analysis. Computers & Geosciences, 2013. 61: p. 71-82.
    [24] Cohen, J., et al., MAD Skills: New Analysis Practices for Big Data. Vol. 2. 2009. 1481-1492.
    [25] Bello-Orgaz, G., J.J. Jung, and D. Camacho, Social big data: Recent achievements and new challenges. Information Fusion, 2016. 28: p. 45-59.
    [26] Linoff, G.S. and M.J.A. Berry, Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. 2011: Wiley.
    [27] Cabena, P., et al., Discovering data mining: from concept to implementation. 1997: Prentice Hall PTR New Jersey.
    [28] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 1996. 39(11): p. 27-34.
    [29] Han, J., J. Pei, and M. Kamber, Data mining: concepts and techniques. 2011: Elsevier.
    [30] Apache™. Apache Hadoop YARN. Available from: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.
    [31] Apache™. Spark. Available from: https://spark.apache.org/.
    [32] python. Available from: https://www.python.org/.
    84
    [33] HDFS Architecture. Available from: http://hadoop.apache.org/common/docs/r0.18.3/images/hdfsarchitecture.gif.
    [34] HDFS Datanodes. Available from: http://hadoop.apache.org/common/docs/r0.18.3/images/hdfsdatanodes.gif.
    [35] White, T., Hadoop技術手冊(第四版). O'Reilly.
    [36] 林大貴, Python+Spark2.0+Hadoop機器學習與大數據分析實戰. 2016.09: 博碩文化.
    [37] Python overtakes R, becomes the leader in Data Science, Machine Learning platforms. Available from: https://www.kdnuggets.com/2017/08/python-overtakes-r-leader-analytics-data-science.html.
    [38] HTML語法教學. 2014; Available from: http://www.powmo.com/.
    [39] php. Available from: http://www.php.net/.
    [40] PHP. 序言 - Manual. 2010; Available from: http://php.net/manual/zh/preface.php.
    [41] PHP. PHP Usage Stats. Available from: http://php.net/usage.php.
    [42] Box, G.E., et al., Time series analysis: forecasting and control. 2015: John Wiley & Sons.
    [43] Akaike, H., Factor analysis and AIC, in Selected Papers of Hirotugu Akaike. 1987, Springer. p. 371-386.
    85
    [44] 隨機森林原理與流程. Available from: https://blog.csdn.net/qingqing7/article/details/78435599.
    [45] Eureka Trees. Available from: https://github.com/ChuckWoodraska/EurekaTrees.
    [46] Classifying with random forests. Available from: https://mahout.apache.org/users/classification/partial-implementation.html.
    [47] sklearn ensemble RandomForestRegressor. Available from: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

    無法下載圖示 全文公開日期 2023/08/28 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE