研究生: |
黃楷鈞 Kai-Chun Huang |
---|---|
論文名稱: |
於 Hadoop Yarn 平台上建置平行化的 FP-Growth 演算法 An Implementation of the parallelized FP-Growth Algorithm in the Hadoop Yarn Platform |
指導教授: |
呂永和
Yungho Leu |
口試委員: |
楊維寧
Wei-Ning Yang 陳雲岫 Yun-Shiow Chen |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 58 |
中文關鍵詞: | 巨量資料集 、高頻項目集探勘 、平行化架構 、Hadoop |
外文關鍵詞: | Big Datasets, Frequent Itemset Mining, Parallel Architecture, Hadoop |
相關次數: | 點閱:204 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
當今網路與資訊爆炸的時代,資料透過網路產生的速度更是呈指數性的成長,
將資料分析成為有價值的資訊是現今大家所關注的議題之一。Hadoop 為當今資
料探勘的熱門平台。Hadoop 的優點是採用分散式的檔案管理,能將大資料集切
割成資料片段,動態地分配給不同的節點。搭配 Hadoop 的 MapReduce 的程式
寫法,將一個工作切為數個子工作,交付不同的節點處理,以加速分析處理問題
的效能;因此,Hadoop 平台適合處理巨量資料集。 FP-Growth 演算法為不產生
候選項目集的高頻項目集探勘演算法,在許多領域中具有很高的實際應用價值。
然而傳統的 FP-Growth 只能處理小資料集,在面對巨量資料集時,處理的時間與
效能較不理想。本研究針對對傳統的 FP-Growth 演算法,利用 MapReduce 運算
技術,將 FP-Growth 演算法平行化於 Hadoop Yarn 平台上。實驗結果發現新的
Yarn 架構,在處理不同資料集時,比舊的 Hadoop 架構具有一定的優勢。
Due to rapid advance of the Internet, data have been generated in an extremely fast speed. In an era of data explosion, people are eager to extract valuable information from large amount of data. Hadoop is a very popular framework for data mining on big datasets. In a Hadoop platform, one can splits a computing task into a set of subtasks to be processed by different computing nodes with the Map and Reduce operations. Furthermore, the Hadoop distributed file system, HDFS, is able to split a big dataset into file segments and to allocate them among different computing nodes of the platform. Therefore, Hadoop computing platform is well suited for data mining on big datasets. The FP-Growth is a famous frequent itemset mining algorithm with many realife applications. However, the traditional FP-Growth algorithm can only deal with small datasets. It is inadequate in handling big datasets. In thesis, we implemented a parallel FP-Growth algorithm in the new Hadoop Yarn platform. The experimental results showed that the new platform offered better performance on the FP-Growth algorithm than an older Hadoop platform did on many different datasets.
[1] Usama Fayyad, Gregory Piatetsky-Shapiro (1996 ). From Data Mining to Knowledge Discovery in Databases. Association for the Advancement of Artificial Intelligence. Pages 37-54.
[2] Chih-Jen-Lin (2014). Big-data Analytics: Challenges and Opportunities, Department of Computer Science. National Taiwan University, August 30, 2014.
[3] Hadoop. Retrieved from http://hadoop.apache.org/.
[4] Apache Official Website. Retrieved from https://httpd.apache.org/.
[5] Rakesh Agrawal (1993). SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data. Pages 207-216.
[6] Jiawie Han (2000). Mining Frequent Patterns without Candidate Generation. SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data, Pages 1-12.
[7] Rakesh Agrawal (1994). Fast Algorithms for Mining Association Rules in Large Databases. VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases, Pages 487-499.
[8] Yanfei Guo (2013). iShuffle: Improving Hadoop Performance with Shuffle-on-Write. 10th International Conference on Autonomic Computing (ICAC ’13), Pages 107-117.
[9] Big Data. Retrieved from http:// zh.wikipedia.org/wiki/bigdata.
[10] Den Laney (2001). 3D data management: controlling data volume, velocity and variety. Application Delivery Strategies. META Group, (February 2001).
[11] Kevin Normandeau, Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity. Big data innovation summit 2013, Boston, Sept. 12, 2013.
[12] Paolo Giudici (2009). Market Basket Analysis. Applied Data Mining for Business and Industry, Second Edition.
[13] QIAN Guang-chao (2008). One Optimized Method of Apriori Algorithm. Electronic Technology & Information Science, Shanghai, 2008.
[14] Grid Computing. Retrieved from http://zh.wikipedia.org/wiki/gridcomputing.
[15] Hadoop Ecosystem. Retrieved from http://www.cloudera.com/training/library/hadoop-ecosystem.html.
[16] Hadoop Yarn. Retrieved from http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[17] Haoyuan Li (2007). PFP: Parallel FP-Growth for Query Recommendation. Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23-25.
[18] Le Zhou (2010). Balanced Parallel FP-Growth with MapReduce. Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference, Page(s):243 – 246.
[19] Xiaoting Wei, Yunlong Ma (2014). Incremental FP-Growth Mining Strategy for DynamicThreshold Value and Database Based on MapReduce. Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design, Page(s):271 – 276.