於 Hadoop Yarn 平台上建置平行化的 FP-Growth 演算法｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	黃楷鈞 Kai-Chun Huang
論文名稱：	於 Hadoop Yarn 平台上建置平行化的 FP-Growth 演算法 An Implementation of the parallelized FP-Growth Algorithm in the Hadoop Yarn Platform
指導教授：	呂永和 Yungho Leu
口試委員:	楊維寧 Wei-Ning Yang 陳雲岫 Yun-Shiow Chen
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理系 Department of Information Management
論文出版年：	2016
畢業學年度：	104
語文別：	中文
論文頁數：	58
中文關鍵詞：	巨量資料集、高頻項目集探勘、平行化架構、Hadoop
外文關鍵詞：	Big Datasets, Frequent Itemset Mining, Parallel Architecture, Hadoop
相關次數：	點閱：204 下載：4
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

當今網路與資訊爆炸的時代，資料透過網路產生的速度更是呈指數性的成長，
將資料分析成為有價值的資訊是現今大家所關注的議題之一。Hadoop 為當今資
料探勘的熱門平台。Hadoop 的優點是採用分散式的檔案管理，能將大資料集切
割成資料片段，動態地分配給不同的節點。搭配 Hadoop 的 MapReduce 的程式
寫法，將一個工作切為數個子工作，交付不同的節點處理，以加速分析處理問題
的效能；因此，Hadoop 平台適合處理巨量資料集。 FP-Growth 演算法為不產生
候選項目集的高頻項目集探勘演算法，在許多領域中具有很高的實際應用價值。
然而傳統的 FP-Growth 只能處理小資料集,在面對巨量資料集時，處理的時間與
效能較不理想。本研究針對對傳統的 FP-Growth 演算法，利用 MapReduce 運算
技術,將 FP-Growth 演算法平行化於 Hadoop Yarn 平台上。實驗結果發現新的
Yarn 架構，在處理不同資料集時，比舊的 Hadoop 架構具有一定的優勢。

Due to rapid advance of the Internet, data have been generated in an extremely fast speed. In an era of data explosion, people are eager to extract valuable information from large amount of data. Hadoop is a very popular framework for data mining on big datasets. In a Hadoop platform, one can splits a computing task into a set of subtasks to be processed by different computing nodes with the Map and Reduce operations. Furthermore, the Hadoop distributed file system, HDFS, is able to split a big dataset into file segments and to allocate them among different computing nodes of the platform. Therefore, Hadoop computing platform is well suited for data mining on big datasets. The FP-Growth is a famous frequent itemset mining algorithm with many realife applications. However, the traditional FP-Growth algorithm can only deal with small datasets. It is inadequate in handling big datasets. In thesis, we implemented a parallel FP-Growth algorithm in the new Hadoop Yarn platform. The experimental results showed that the new platform offered better performance on the FP-Growth algorithm than an older Hadoop platform did on many different datasets.

摘要	I
ABSTRACT	II
目錄	III
圖目錄	V
表目錄	VI
第一章　緒論	1
1.1 研究背景	1
1.2 研究動機與目的	3
1.3 研究挑戰與限制	4
1.4 研究貢獻	5
1.5 研究架構	6
第二章　文獻探討	7
2.1 巨量資料	7
2.2 關聯規則	10
2.3 FP-Growth演算法	13
2.3.1 FP-Tree的建構方式	14
2.3.2 FP-Growth演算法	18
2.3.3 FP-Growth演算法的缺失	20
2.4平行化計算架構	21
2.4.1 Hadoop	22
2.4.2 MapReduce	25
第三章	研究方法	36
3.1 研究架構	36
3.2 研究方法	37
3.3 PFP演算法	38
第四章 方法實作與實驗結果	43
4.1 實驗環境建置	43
4.2 方法實作	44
4.3 PFP演算法於資料頻繁項目集之比較	46
第五章  結論	56
5.1 結論	56
參考文獻	57

                                

[1] Usama Fayyad, Gregory Piatetsky-Shapiro (1996 ). From Data Mining to Knowledge Discovery in Databases. Association for the Advancement of Artificial Intelligence. Pages 37-54.
[2] Chih-Jen-Lin (2014). Big-data Analytics: Challenges and Opportunities, Department of Computer Science. National Taiwan University, August 30, 2014.
[3] Hadoop. Retrieved from http://hadoop.apache.org/.
[4] Apache Official Website. Retrieved from https://httpd.apache.org/.
[5] Rakesh Agrawal (1993). SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data. Pages 207-216.
[6] Jiawie Han (2000). Mining Frequent Patterns without Candidate Generation. SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data, Pages 1-12.
[7] Rakesh Agrawal (1994). Fast Algorithms for Mining Association Rules in Large Databases. VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases, Pages 487-499.
[8] Yanfei Guo (2013). iShuffle: Improving Hadoop Performance with Shuffle-on-Write. 10th International Conference on Autonomic Computing (ICAC ’13), Pages 107-117.
[9] Big Data. Retrieved from http:// zh.wikipedia.org/wiki/bigdata.
[10] Den Laney (2001). 3D data management: controlling data volume, velocity and variety. Application Delivery Strategies. META Group, (February 2001).
[11] Kevin Normandeau, Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity. Big data innovation summit 2013, Boston, Sept. 12, 2013.
[12] Paolo Giudici (2009). Market Basket Analysis. Applied Data Mining for Business and Industry, Second Edition.
[13] QIAN Guang-chao (2008). One Optimized Method of Apriori Algorithm. Electronic Technology & Information Science, Shanghai, 2008.
[14] Grid Computing. Retrieved from http://zh.wikipedia.org/wiki/gridcomputing.
[15] Hadoop Ecosystem. Retrieved from http://www.cloudera.com/training/library/hadoop-ecosystem.html.
[16] Hadoop Yarn. Retrieved from http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[17] Haoyuan Li (2007). PFP: Parallel FP-Growth for Query Recommendation. Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23-25.
[18] Le Zhou (2010). Balanced Parallel FP-Growth with MapReduce. Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference, Page(s):243 – 246.
[19] Xiaoting Wei, Yunlong Ma (2014). Incremental FP-Growth Mining Strategy for DynamicThreshold Value and Database Based on MapReduce. Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design, Page(s):271 – 276.

簡易檢索 / 詳目顯示

相關論文