簡易檢索 / 詳目顯示

研究生: 黃楷鈞
Kai-Chun Huang
論文名稱: 於 Hadoop Yarn 平台上建置平行化的 FP-Growth 演算法
An Implementation of the parallelized FP-Growth Algorithm in the Hadoop Yarn Platform
指導教授: 呂永和
Yungho Leu
口試委員: 楊維寧
Wei-Ning Yang
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 58
中文關鍵詞: 巨量資料集高頻項目集探勘平行化架構Hadoop
外文關鍵詞: Big Datasets, Frequent Itemset Mining, Parallel Architecture, Hadoop
相關次數: 點閱:204下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 當今網路與資訊爆炸的時代,資料透過網路產生的速度更是呈指數性的成長,
    將資料分析成為有價值的資訊是現今大家所關注的議題之一。Hadoop 為當今資
    料探勘的熱門平台。Hadoop 的優點是採用分散式的檔案管理,能將大資料集切
    割成資料片段,動態地分配給不同的節點。搭配 Hadoop 的 MapReduce 的程式
    寫法,將一個工作切為數個子工作,交付不同的節點處理,以加速分析處理問題
    的效能;因此,Hadoop 平台適合處理巨量資料集。 FP-Growth 演算法為不產生
    候選項目集的高頻項目集探勘演算法,在許多領域中具有很高的實際應用價值。
    然而傳統的 FP-Growth 只能處理小資料集,在面對巨量資料集時,處理的時間與
    效能較不理想。本研究針對對傳統的 FP-Growth 演算法,利用 MapReduce 運算
    技術,將 FP-Growth 演算法平行化於 Hadoop Yarn 平台上。實驗結果發現新的
    Yarn 架構,在處理不同資料集時,比舊的 Hadoop 架構具有一定的優勢。


    Due to rapid advance of the Internet, data have been generated in an extremely fast speed. In an era of data explosion, people are eager to extract valuable information from large amount of data. Hadoop is a very popular framework for data mining on big datasets. In a Hadoop platform, one can splits a computing task into a set of subtasks to be processed by different computing nodes with the Map and Reduce operations. Furthermore, the Hadoop distributed file system, HDFS, is able to split a big dataset into file segments and to allocate them among different computing nodes of the platform. Therefore, Hadoop computing platform is well suited for data mining on big datasets. The FP-Growth is a famous frequent itemset mining algorithm with many realife applications. However, the traditional FP-Growth algorithm can only deal with small datasets. It is inadequate in handling big datasets. In thesis, we implemented a parallel FP-Growth algorithm in the new Hadoop Yarn platform. The experimental results showed that the new platform offered better performance on the FP-Growth algorithm than an older Hadoop platform did on many different datasets.

    摘要 I ABSTRACT II 目錄 III 圖目錄 V 表目錄 VI 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 3 1.3 研究挑戰與限制 4 1.4 研究貢獻 5 1.5 研究架構 6 第二章 文獻探討 7 2.1 巨量資料 7 2.2 關聯規則 10 2.3 FP-Growth演算法 13 2.3.1 FP-Tree的建構方式 14 2.3.2 FP-Growth演算法 18 2.3.3 FP-Growth演算法的缺失 20 2.4平行化計算架構 21 2.4.1 Hadoop 22 2.4.2 MapReduce 25 第三章 研究方法 36 3.1 研究架構 36 3.2 研究方法 37 3.3 PFP演算法 38 第四章 方法實作與實驗結果 43 4.1 實驗環境建置 43 4.2 方法實作 44 4.3 PFP演算法於資料頻繁項目集之比較 46 第五章 結論 56 5.1 結論 56 參考文獻 57

    [1] Usama Fayyad, Gregory Piatetsky-Shapiro (1996 ). From Data Mining to Knowledge Discovery in Databases. Association for the Advancement of Artificial Intelligence. Pages 37-54.
    [2] Chih-Jen-Lin (2014). Big-data Analytics: Challenges and Opportunities, Department of Computer Science. National Taiwan University, August 30, 2014.
    [3] Hadoop. Retrieved from http://hadoop.apache.org/.
    [4] Apache Official Website. Retrieved from https://httpd.apache.org/.
    [5] Rakesh Agrawal (1993). SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data. Pages 207-216.
    [6] Jiawie Han (2000). Mining Frequent Patterns without Candidate Generation. SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data, Pages 1-12.
    [7] Rakesh Agrawal (1994). Fast Algorithms for Mining Association Rules in Large Databases. VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases, Pages 487-499.
    [8] Yanfei Guo (2013). iShuffle: Improving Hadoop Performance with Shuffle-on-Write. 10th International Conference on Autonomic Computing (ICAC ’13), Pages 107-117.
    [9] Big Data. Retrieved from http:// zh.wikipedia.org/wiki/bigdata.
    [10] Den Laney (2001). 3D data management: controlling data volume, velocity and variety. Application Delivery Strategies. META Group, (February 2001).
    [11] Kevin Normandeau, Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity. Big data innovation summit 2013, Boston, Sept. 12, 2013.
    [12] Paolo Giudici (2009). Market Basket Analysis. Applied Data Mining for Business and Industry, Second Edition.
    [13] QIAN Guang-chao (2008). One Optimized Method of Apriori Algorithm. Electronic Technology & Information Science, Shanghai, 2008.
    [14] Grid Computing. Retrieved from http://zh.wikipedia.org/wiki/gridcomputing.
    [15] Hadoop Ecosystem. Retrieved from http://www.cloudera.com/training/library/hadoop-ecosystem.html.
    [16] Hadoop Yarn. Retrieved from http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
    [17] Haoyuan Li (2007). PFP: Parallel FP-Growth for Query Recommendation. Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys 2008, Lausanne, Switzerland, October 23-25.
    [18] Le Zhou (2010). Balanced Parallel FP-Growth with MapReduce. Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference, Page(s):243 – 246.
    [19] Xiaoting Wei, Yunlong Ma (2014). Incremental FP-Growth Mining Strategy for DynamicThreshold Value and Database Based on MapReduce. Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design, Page(s):271 – 276.

    QR CODE