簡易檢索 / 詳目顯示

研究生: 詹佳芸
Chia-Yun Chan
論文名稱: 在雲端平台上建立機器學習機制用於資料分析與預測
Building Machine Learning Mechanism on Cloud Platform for Data Analysis and Prediction
指導教授: 呂政修
Jenq-Shiou Leu
口試委員: 石維寬
Wei-Kuan Shih
孫敏德
Min-Te Sun
陳維美
Wei-Mei Chen
鄭欣明
Shin-Ming Cheng
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 36
中文關鍵詞: 機器學習線性回歸決策樹隨機森林Apache Spark雲端運算
外文關鍵詞: Machine Learning, Linear Regression, Decision Tree, Random Forest, Apache Spark, Cloud Computing
相關次數: 點閱:281下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 資料分析與預測在各個產業都逐漸受到重視,銷售預測系統對於企商業策略以及業務發展更是重要的一環。本篇論文將零售商之實際銷售數據根據特徵各別進行統計與視覺化分析,並且篩選出對於訓練預測模型建立最有影響力的特徵組合。基於三種機器學習的技術,其中包含線性回歸(Linear Regression)、決策樹(Decision Tree),以及隨機森林(Random Forest),建立出不同的銷售預測模型,並且反覆調整特徵組合,以提升其精準度。透過我們的實驗,發現Random Forest的預測模型精準度最高。經過特徵組合的改良,MAPE(平均絕對誤差百分比)從0.32740降低到0. 28912,其誤差值降低了11.7%,RMSPE(均方根誤差百分比)值從0.51074降低到0. 41805,誤差值降低了18.1%。
    本篇論文的現實意義在於為銷售數據分析提供一種高精準度的解決方案,並且實做一個Spark環境,實現所提出之在雲端平台進行銷售數據分析及預測。


    Data analysis and prediction are getting more attention in every industry. One of the best approaches for enterprise business strategy and sales development is the sales prediction system. During our development stage, we applied statistics and visualization for analyzing each feature of the real retail sales revenue and further selected the most effective feature set for building the training model for machine learning. We built various types of sales prediction model based on three machine learning techniques, including Linear Regression, Decision Tree and Random Forest. Moreover, we repeatedly fine-tuned to improve the precision of the feature set. It is noteworthy that our experiment demonstrates the prediction model of Random Forest having the best accuracy. The result of refining the feature set is that the MAPE(Mean Absolute Percentage Error) is improved from 0.32740 to 0.28912, which is 11.7% decrease in the error rate. Also, the RMSPE(Root Mean Square Percentage Error) is improved from 0.51074 to 0.41805, which means a 18.1% error rate reduction.
    The contribution in this study is providing a highly accurate method for sales revenue analysis and prediction. In addition, we implemented a Spark-based computation environment to establish sales analysis and prediction on cloud platform.

    論文摘要 I ABSTRACT II 誌謝 III 目錄 IV 圖片索引 VI 表格索引 VII 第 1 章 緒論 1 1.1 前言 1 1.2 研究動機及目的 2 1.3 論文架構 2 第 2 章 研究背景與相關技術 3 2.1 相關研究 3 2.2 Apache Hadoop 3 2.3 Apache Spark 4 2.3.1 Spark核心組件 5 2.3.2 RDD 6 2.4 機器學習演算法 7 2.4.1 Linear Regression 7 2.4.2 Decision Tree 8 2.4.3 Random Forests 9 第 3 章 系統架構與資料模型建立 10 3.1 系統架構和流程 10 3.2 資料集介紹和分析 13 3.3 模型建立 17 第 4 章 效能評估 19 4.1 系統環境介紹 19 4.2 效能評估方式 22 4.3 效能評估結果 22 第 5 章 結論及未來展望 24 參考文獻 25

    [1] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop Distributed File System." pp. 1-10.
    [2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, Boston, MA, 2010, pp. 10-10.
    [3] "IT趨勢白皮書," http://www.bnext.com.tw/article/view/id/37952.
    [4] J. T. Chien, “Linear regression based Bayesian predictive classification for speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 1, pp. 70-79, 2003.
    [5] B. Akgün and ş. G. Öğüdücü, “Streaming Linear Regression on Spark MLlib and MOA,” in Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, Paris, France, 2015, pp. 1244-1247.
    [6] R. J. Kuo and K. C. Xue, “Fuzzy neural networks with application to sales forecasting,” Fuzzy Sets Syst., vol. 108, no. 2, pp. 123-143, 1999.
    [7] F. M. Thiesing and O. Vornberger, "Sales forecasting using neural networks." pp. 2125-2128 vol.4.
    [8] A. K. Kirshners, S. V. Parshutin, and A. N. Borisov, “Combining clustering and a decision tree classifier in a forecasting task,” Automatic Control and Computer Sciences, vol. 44, no. 3, pp. 124-132, 2010.
    [9] S. Thomassey and A. Fiordaliso, “A hybrid sales forecasting system based on clustering and decision trees,” Decis. Support Syst., vol. 42, no. 1, pp. 408-421, 2006.
    [10] S. Y. Sohn and T. H. Moon, “Decision Tree based on data envelopment analysis for effective technology commercialization,” Expert Syst. Appl., vol. 26, no. 2, pp. 279-284, 2004.
    [11] L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5-32, 2001.
    [12] A. Liaw and M. Wiener, “Classification and Regression by randomForest,” R News, vol. 2, pp. 18-22, 2002.
    [13] R. Díaz-Uriarte and S. Alvarez de Andrés, “Gene selection and classification of microarray data using random forest,” BMC Bioinformatics, vol. 7, no. 1, pp. 1-13, 2006.
    [14] M, x, Y. Kaya, and M. E. Karsligil, "Stock price prediction using financial news articles." pp. 478-482.
    [15] C. Schwenke, V. Vasyutynskyy, and K. Kabitzsch, "Analysis and simulation of sales receipt data in supermarkets." pp. 1-8.
    [16] S. Patnaik, M. R. Ghazi, and D. Gangodkar, “Hadoop, MapReduce and HDFS: A Developers Perspective,” in International Conference on Computer, Communication and Convergence (ICCC 2015), 2015, pp. 45-50.
    [17] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar, “MLlib: machine learning in apache spark,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 1235-1241, 2016.
    [18] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, San Jose, CA, 2012, pp. 2-2.
    [19] S. Lee, “Using data envelopment analysis and decision trees for efficiency analysis and recommendation of B2C controls,” Decis. Support Syst., vol. 49, no. 4, pp. 486-497, 2010.

    QR CODE