簡易檢索 / 詳目顯示

研究生: 汪志軒
Chih-Hsuan Wang
論文名稱: 基於MapReduce計算架構的巨量資料高頻項目探勘演算法
A MapReduce Based Frequent Itemset Mining Algorithm for Big Data
指導教授: 呂永和
Yung-Ho Leu
口試委員: 楊維寧
Wei-Ning Yang
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 45
中文關鍵詞: MapReduceHadoop高頻項目組巨量資料負載平衡
外文關鍵詞: MapReduce, Hadoop, Frequent Itemset Mining, Load Balancing, Big Data
相關次數: 點閱:193下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 高頻項目探勘(Frequent Itemset Mining)演算法,與現今許多成功的商業應用案例息息相關,在現代的研究中佔有一席之地。Hadoop的MapReduce計算架構在可擴展性(Scalability)與容錯性(Fault Tolerance)上表現的相當出色。而在高頻項目探勘平行化的研究中,Mahout 的Parallel FP-Growth演算法運用MapReduce計算架構,具有良好的運算效能及可擴展性(Scalability)。
    因此,本研究在MapReduce計算架構上提出一個平行化且高效率的高頻項目探勘(Frequent Itemset Mining)演算法,改善既有演算法Mahout 的Parallel FP-Growth(PFPGrowth)中負載平衡的問題,加入二次子任務切割,有效提升其運算效率。經由多個資料集的實驗顯示,我們所提的方法有效地降低了資料探勘所需的時間。


    Frequent itemset mining is an important tool for data mining. There are already many successful applications of frequent itemset mining tool in today's business. Due to the advent of big data, traditional data mining algorithms are inadequate in terms of the time required in data mining. Fortunately, the Hadoop MapReduce computing framework offers a way to design a scalable and fault tolerant algorithm to tackle the problem of long computing time in mining a big dataset. The parallel FP-Growth algorithm in Mahout is a fast algorithm based on MapReduce computing framework. In this study, we proposed a load balancing method to improve the performance of the Mahout parallel FP-Growth algorithm. By further dividing the conditional databases of frequent 1-itemsets into smaller conditional databases, the proposed method significantly reduce the execution times of many example datasets.

    摘要 I Abstract II 目錄 III 圖目錄 V 表目錄 VII 第一章 緒論 1 1-1 研究背景 1 1-2 研究動機 6 1-3 研究目的 6 1-4 研究貢獻 6 1-5 研究架構 7 第二章 文獻探討 8 2-1 Apriori平行化研究 8 2-2 FPGrowth平行化研究 8 2-3 遞迴式FP-Growth平行化研究 9 2-4 相關研究小結 13 第三章 研究方法 14 3-1 演算法架構 14 3-2計算頻繁項目 15 3-3依照項目與其Prefix Path作切割分配 16 3-4平行執行在各節點中主機的FP-Growth運算 16 3-5彙整各節點的結果 17 3-6負載平衡之設計與說明 17 3-7二次切割負載平衡 18 第四章 實驗結果與分析 20 4-1 測試環境 20 4-2實驗結果分析 21 4-2-1 accidents.dat在各支持度下效能及負載平衡比較 21 4-2-2 retail.dat在各支持度下效能及負載平衡比較 25 4-2-3 T10I4D100K.dat在各支持度下效能及負載平衡比較 28 4-2-4 T40I10D100K.dat在各支持度下效能及負載平衡比較 31 4-2-5 二次切割在各支持度下效能及負載平衡比較 34 4-3實驗結果小結 42 第五章 結論 43 參考文獻 44

    [1]Zahra Farzanyar,Nick Cercone,”Accelerating Frequent Itemsets Mining on the Cloud: A MapReduce -Based Approach,” IEEE 13th International Conference on Data Mining Workshops (ICDMW), 2013.
    [2]Qiu LU,Xiao-hui CHENG,”The Research of Decision Tree Mining Based on Hadoop,” IEEE 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2012.
    [3]John Shafer,Rakeeh Agrawal,Manish Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996
    [4]Avita Katal,Mohammad Wazid,R H Goudar, “Big Data: Issues, Challenges, Tools and Good Practices , ” IEEE, Sixth International Conference on Contemporary Computing (IC3), 2013.
    [5]Guanghui Xu, Feng Xu, Hongxu Ma, “Deploying and Researching Hadoop in Virtual Machines,” IEEE International Conference on Automation and Logistics Zhengzhou, China, August 2012.
    [6] Sheela Gole, Bharat Tidke, “ Frequent Itemset Mining for Big Data in social media using ClustBigFIM algorithm ,” International Conference on Pervasive Computing (ICPC), 2015.
    [7] Hui Chen, Tsau Young Lin, Zhibing Zhang, Jie Zhong, “Parallel Mining Frequent Patterns over Big Transactional Data in Extended MapReduce ,” IEEE International Conference on Granular Computing (GrC), 2013.
    [8]Dawen Xia1,Yanhui Zhou, Zhuobo Rong, Zili Zhang, “IPFP: An Improved Parallel FP-Growth Algorithm for Frequent ItemsetsMining,”第五十九屆世界統計大會, 2013.
    [9]“Mahout源码分析:并行化FP-Growth算法 ” Mark Lin. N.p., n.d. Web. 21 July 2015.
    [10] “华夏35度” FP-Tree算法的实现. N.p., n.d. Web. 21 July 2015.
    [11]“Mahout Pfpgrowth - GrepCode.com - Java Source Code Search 2.0.” GrepCode.com - Java Source Code Search 2.0. N.p., n.d. Web. 21 July 2015.
    [12] “Frequent Itemset Mining Dataset Repository.” Frequent Itemset Mining Dataset Repository. N.p., n.d. Web. 21 July 2015.
    [13] “Inside 硬塞的網路趨勢觀察 ” Inside RSS. N.p., n.d. Web. 21 July 2015.
    [14] “一亩三分地” Mahout安装(Hadoop 1.2.1 版本). N.p., n.d. Web. 21 July 2015.
    [15] “Wu Peng Ta's BLOG” : Eclipse開發MapReduce程式(1). N.p., n.d. Web. 21 July 2015.
    [16]Wang, Jie, and Yu Zeng. “The Optimization of Parallel Frequent Pattern Growth Algorithm Based on Mahout in Cloud Manufacturing Environment, ” Seventh International Symposium on Computational Intelligence and Design , 2014.
    [17] Pramudiono, Iko, and Masaru Kitsuregawa. “Parallel FP-growth on PC cluster, ” Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2003. 467-473.

    QR CODE