簡易檢索 / 詳目顯示

研究生: 牛仁正
Mohammad - Riza Nurtam
論文名稱: 在Hadoop平台下使用資料分群方法分析台灣農作物之價量資訊
Data Clustering on Taiwan Crop Sales under Hadoop Platform
指導教授: 楊朝龍
Chao-Lung Yang
口試委員: 歐陽超
Chao Ou-Yang
郭人介
Ren-Jieh Kuo
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 42
中文關鍵詞: 巨量資料資料分群Hadoop農作物價量分析
外文關鍵詞: Big data, Hadoop, Mahout, Clustering
相關次數: 點閱:287下載:13
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Hadoop平行運算平台是一個具備分析巨量資料能力的雲端平行運算平台,已有相當多的雲端應用已建立於這個運算平台之上。本研究透過收集農委會公開資料的方式取得台灣農產品的資料,並利用Hadoop平行運算平台實作一個資料分析的環境,並使用Hadoop平台中的資料探勘模組Mahout進行台灣農產品資料的價量分析。在本研究中,資料分群的方法:K-means被運用於分析特定農產品的價量資料。結合決策樹合析可發現,農作物的價量資訊中隱含了特定的群組特徵,且可使用作物交易週次、市場及氣候進行價量資料的預測,其結果可作為農產品栽種時機之參考或作為偵測市場需求變動。


    Hadoop is one of the most promising cloud computing platforms to execute a Big Data analytics task which is a process of discovering hidden patterns, unknown correlations, and other valuable information from an extremely large distributed dataset. In this thesis, a data clustering learning was implemented under Hadoop platform to study a large crop sales dataset collected distributedly in Taiwan. Hadoop infrastructure was established to give access of the distributed data centers. An online clustering algorithm utilizing Mahout, a scalable machine learning library, was performed to analyze crop price and yield data from the distributed datasets. This clustering analysis is usually exhausting and time consuming if a single machine is in charge of the whole process. Therefore, in this research, the clustering jobs were handled under an experimental distributed Hadoop environment. The experimental result shows the price and sale volume can grouped by couple clusters. The result can be used on the decision making of crop planning by forecasting or detecting demand changes in the market as early as possible.

    摘 要 I ABSTRACT II Acknowledgement III CONTENTS IV LIST OF TABLES VII LIST OF FIGURES VIII CHAPTER 1 INTRODUCTION 1 1.1 Background 1 1.2 Scope and limitation 2 1.3 Objective 3 CHAPTER 2 LITERATURE REVIEW 5 2.1 Big Data 5 2.2 Hadoop 7 2.2.1 MapReduce 9 2.2.2 Hadoop Distributed File System 11 2.3 Mahout 12 2.3.1 Algorithms in Mahout 13 2.3.2 Kmeans 14 2.4 Decision Tree 15 CHAPTER 3 RESEARCH METHODOLOGY 17 3.1 Data Collection 17 3.1.1 Failure Handling 20 3.2 Hadoop Cluster and Mahout Setup 21 3.3 Experiments 24 CHAPTER 4 EXPERIMENT AND RESULTS 26 4.1 Data preparation 26 4.2 Preliminary data analysis 27 4.3 Clustering result 29 4.3.1 Dragon fruit data clustering 33 4.4 Decision tree analysis 35 CHAPTER 5 CONCLUSIONS AND DISCUSSION 40 5.1 Conclusions 40 5.2 Discussion 41 5.3 Future Research 41 REFERENCES 43 APPENDIX 1 A. Hadoop node configuration : core-site.xml 1 B. Hadoop node configuration : mapred-site.xml 1 C. Hadoop node configuration: hdfs-site.xml 2 D. Mahout commands 2 E. Linux .bashrc configuration Hadoop and Mahout environment settings 2 F. /etc/hosts configuration 3 G. Multivariate linear model of tomato data clustering 3 H. Multivariate linear model of dragon fruit data clustering 3 I. Persimmon tomato decision tree 4 J. Dragon fruit decision tree 5

    Chu, C., S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng and K. Olukotun. Map-Reduce for Machine Learning on Multicore. NIPS, MIT Press, (2006).
    CSC Leading Edge Forum. Data Revolution. USA, CSC Leading Edge Forum, (2011).
    Dean, J. and S. Ghemawat. MapReduce: simplified data processing on large clusters. 6th conference on Symposium on Opearting Systems Design & Implementation. San Francisco, CA, USENIX Association. 6: 10-10, (2004).
    Dean, J. and S. Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51(1): 107-113, (2008).
    Esteves, R. M. and R. Chunming. Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud. 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), (2011).
    Esteves, R. M., R. Pais and R. Chunming. K-means Clustering in the Cloud -- A Mahout Test. 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA), (2011).
    Fisher, D., R. DeLine, M. Czerwinski and S. Drucker. "Interactions with big data analytics." interactions 19(3): 50-59, (2012).
    Gillick, D., A. Faria and J. DeNero. "Mapreduce: Distributed computing for machine learning." (2006).
    Gopalkrishnan, V., D. Steier, H. Lewis and J. Guszcza. Big data, big business: Bridging the gap. 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine-12, Beijing, China, Association for Computing Machinery, (2012).
    Hadoop. "Welcome to Apache™ HadoopR!" Retrieved 5/15/2013, from http://hadoop.apache.org/, (2013).
    Hadoop Contributors. "Hadoop Commands Guide." Retrieved 2013-05-15, from http://hadoop.apache.org/docs/r1.0.4/commands_manual.html, (2013).
    Harris, D. "The history of Hadoop: From 4 nodes to the future of data." Retrieved 2013-05-15, from http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/, (2013, 04/04/2013).
    Jain, A. K. "Data clustering: 50 years beyond K-means." Pattern Recogn. Lett. 31(8): 651-666, (2010).
    Kotsiantis, S. B. Supervised Machine Learning: A Review of Classification Techniques. Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies, IOS Press: 3-24, (2007).
    Lin, M., E. Haihong and X. Ke. The design and implementation of distributed mobile points of interest(POI) based on Mahout. Pervasive Computing and Applications (ICPCA), 2011 6th International Conference on, (2011).
    Mahout Contributors. "Mahout Algorithms." from https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms, (2013).
    Noll, M. "Running Hadoop on Ubuntu Linux." from http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/, (2011).
    Owen, S., R. Anil, T. Dunning and E. Friedman. Mahout in Action, Manning Publications, (2011).
    Rouse, M. "Definition; Big Data Analytics." Retrieved 5/05/2013, from http://searchbusinessanalytics.techtarget.com/definition/big-data-analytics, (2012).
    Salzberg, S. "C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993." Machine Learning 16(3): 235-240, (1994).
    Sammer, E. Hadoop Operations. USA, O’Reilly Media, Inc., (2012).
    Shvachko, K., K. Hairong, S. Radia and R. Chansler. The Hadoop Distributed File System. 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), (2010).
    Taiwan Agriculture and Food Agency. Agriculture Market Information System. 2013, (2013).
    Turck, M. and S. Zilis. "A chart of the big data ecosystem, take 2." http://mattturck.com/2012/10/15/a-chart-of-the-big-data-ecosystem-take-2/ Accessed 6/15/2013, (2012).
    White, T. Hadoop: The definitive guide. USA, O'Reilly Media, Inc., (2012).
    Wittek, P. and S. Daranyi. "Accelerating text mining workloads in a MapReduce-based distributed GPU environment." Journal of Parallel and Distributed Computing 73(2): 198-206, (2013).
    Zikopoulos, P., C. Eaton, D. deRoos, T. Deutsch and G. Lapis. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, McGraw-Hill Education, (2011).

    QR CODE