研究生: 牛仁正
Mohammad - Riza Nurtam
論文名稱: 在Hadoop平台下使用資料分群方法分析台灣農作物之價量資訊
Data Clustering on Taiwan Crop Sales under Hadoop Platform
指導教授: 楊朝龍
Chao-Lung Yang
口試委員: 歐陽超
Chao Ou-Yang
Ren-Jieh Kuo
學位類別: 碩士
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 42
中文關鍵詞: 巨量資料資料分群Hadoop農作物價量分析
外文關鍵詞: Big data, Hadoop, Mahout, Clustering
Hadoop is one of the most promising cloud computing platforms to execute a Big Data analytics task which is a process of discovering hidden patterns, unknown correlations, and other valuable information from an extremely large distributed dataset. In this thesis, a data clustering learning was implemented under Hadoop platform to study a large crop sales dataset collected distributedly in Taiwan. Hadoop infrastructure was established to give access of the distributed data centers. An online clustering algorithm utilizing Mahout, a scalable machine learning library, was performed to analyze crop price and yield data from the distributed datasets. This clustering analysis is usually exhausting and time consuming if a single machine is in charge of the whole process. Therefore, in this research, the clustering jobs were handled under an experimental distributed Hadoop environment. The experimental result shows the price and sale volume can grouped by couple clusters. The result can be used on the decision making of crop planning by forecasting or detecting demand changes in the market as early as possible.

摘 要 I ABSTRACT II Acknowledgement III CONTENTS IV LIST OF TABLES VII LIST OF FIGURES VIII CHAPTER 1 INTRODUCTION 1 1.1 Background 1 1.2 Scope and limitation 2 1.3 Objective 3 CHAPTER 2 LITERATURE REVIEW 5 2.1 Big Data 5 2.2 Hadoop 7 2.2.1 MapReduce 9 2.2.2 Hadoop Distributed File System 11 2.3 Mahout 12 2.3.1 Algorithms in Mahout 13 2.3.2 Kmeans 14 2.4 Decision Tree 15 CHAPTER 3 RESEARCH METHODOLOGY 17 3.1 Data Collection 17 3.1.1 Failure Handling 20 3.2 Hadoop Cluster and Mahout Setup 21 3.3 Experiments 24 CHAPTER 4 EXPERIMENT AND RESULTS 26 4.1 Data preparation 26 4.2 Preliminary data analysis 27 4.3 Clustering result 29 4.3.1 Dragon fruit data clustering 33 4.4 Decision tree analysis 35 CHAPTER 5 CONCLUSIONS AND DISCUSSION 40 5.1 Conclusions 40 5.2 Discussion 41 5.3 Future Research 41 REFERENCES 43 APPENDIX 1 A. Hadoop node configuration : core-site.xml 1 B. Hadoop node configuration : mapred-site.xml 1 C. Hadoop node configuration: hdfs-site.xml 2 D. Mahout commands 2 E. Linux .bashrc configuration Hadoop and Mahout environment settings 2 F. /etc/hosts configuration 3 G. Multivariate linear model of tomato data clustering 3 H. Multivariate linear model of dragon fruit data clustering 3 I. Persimmon tomato decision tree 4 J. Dragon fruit decision tree 5

