在Spark叢集運算框架下採用二次切割的頻繁項目集演算法

簡易檢索 / 詳目顯示

回結果列表

研究生：	戴佑軒 Yu-Hsuan Tai
論文名稱：	在Spark叢集運算框架下採用二次切割的頻繁項目集演算法 A Two-Level Partition Based Frequent Itemset Mining Algorithm in Spark
指導教授：	呂永和 Yung-Ho Leu
口試委員:	楊維寧 Wei-Ning Yang 陳雲岫 Yun-Shiow Chen
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理系 Department of Information Management
論文出版年：	2019
畢業學年度：	107
語文別：	中文
論文頁數：	94
中文關鍵詞：	大數據、資料探勘、關聯規則、頻繁項目、FP-Growth 、FP-Tree 、Spark 、MLlib 、MapReduce
外文關鍵詞：	Big Data, Data Mining, Association Rule, Frequent Pattern, FP-Growth, FP-Tree, Spark, MLlib, MapReduce
相關次數：	點閱：302 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

　　我們正處於資訊爆炸的時代，「大數據(BigData)」相關應用議題已成顯學，掌握數據，可以洞悉過去、展望未來，藉以暸解發展的趨勢、把握可能的商機，是無論任何領域，都不能錯過的一門新學問，「關聯規則」是大數據分析裡重要的一部分，在商業決策、文字探勘、資訊安全等應用上頗具潛力，而頻繁項目集問題作為關聯性分析的基石，如何更準確、有效率的挖掘頻繁項目集，自然成為日益受到重視的議題。

　　FP-Gorwth演算法是頻繁項目集挖掘的經典演算法之一，透過建立FP-Tree的方式，在不產生候選項目的情況下產生頻繁項目集，在Apache Spark這個開源叢集運算框架中也內建了一套採用MapReduce架構的平行化的FP-Growth演算法供數據分析時使用，透過切割資料集為任務獨立子資料集的方式，使FP-Tree的挖掘可交由多個運算節點平行處理，大幅提升了大型資料集的頻繁項目挖掘效率。

　　本論文針對平行運算在面對資料傾斜問題可能造成的瓶頸，採用子資料集二次切割的概念，針對挖掘頻繁項目集過程中，可能需要較長運算時間的子資料集進行再切割，對Apache Spark的平行化FP-Growth演算法進行改進，經由多個資料集實驗後的結果顯示，此方案在較低最小支持度的條件下，可更充分的發揮叢集運算的優勢，取得效率上的改進。

We live in an era of information explosion. Topics related to Big Data and its applications have been of broad and current interest. Through data, we can both look back in time and look into the future, understanding trends and grasping possible business opportunities. Association rule mining plays an important role in big data analytics and has potential applications in business decision making, text mining and information security. Since the frequent itemset mining problem is the foundations of association rule analysis, how to find frequent itemsets efficiently has become an important issue.

The FP-Growth algorithm is one of the fast algorithms for frequent itemset mining. Through the construction of an FP-Tree, the algorithm generates frequent itemsets without generating candidate itemsets. A built-in parallel version of FP-Growth algorithm is provided in Apache Spark which is an open-source cloud computing framework. By dividing datasets into task independent sub-datasets, the mining task can be processed by multiple computational nodes in parallel.

In this thesis, we propose an improvement on the parallel FP-Growth algorithm in Apache Spark to resolve the possible bottleneck in parallel processing caused by data skew. The solution adopts the concept of re-partition of sub-datasets which might take longer time in frequent itemset mining. Experimental results on multiple datasets showed that our method can resolve the bottleneck in parallel frequent itemset mining so as to reduce the total computation time for frequent itemset mining with small minimum supports.

摘　要    I
ABSTRACT    II
誌　謝    III
目　錄    IV
圖目錄    VI
表目錄    VIII
第一張　緒論    1
1　研究背景及動機    1
2　研究目的    1
3　研究挑戰    2
4　研究限制    2
5　研究貢獻    2
6　論文架構    3
第二章　文獻探討    4
1　APACHE HADOOP    4
2　APACHE SPARK    6
3　SCALA    8
4　FP-GROWTH演算法    9
4.1　FP-Growth演算法流程    10
6　PFP : FP-GROWTH的平行化    19
6.1　Parallel FP-Growth演算法流程    21
第三章 研究方法    30
1　研究說明    30
2　使用資料集    30
3　二次切割    30
4　原版SPARK MLLIB PARALLEL FP-GROWTH    32
4.1　原版Spark MLlib Parallel FP-Growth流程    34
第四章　方法實作與實驗結果    42
1　環境建置    42
1.1　電腦硬體資源配置    43
1.2　Spark配置參數    44
2　方法實作    45
3套用二次切割於SPARK MLLIB FP-GROWTH    46
3.1二次切割版Spark MLlib Parallel FP-Growth流程    47
4實驗結果與分析    60
4.1 Mushroom資料集    61
4.2 Chess資料集    62
4.3 Connect資料集    63
4.3 Kosarak資料集    64
4.4 T10I4D100K資料集    65
4.5 資料傾斜現象觀察    66
第五章　結論與未來展望    69
1 實驗結論    69
2 未來展望    69
參考文獻    70
附錄　程式碼    71
                                

[1] Daniel J. Power (2002) DSS News, A Bi-Weekly Publication of DSSResources.com ( http://www.dssresources.com/newsletters/66.php )
[2] Chih-Cheng Liang (2016) 莫再提了！啤酒尿布是都市傳說
( https://medium.com/@chihchengliang/67b96e01b3f0 )
[3] Hadoop Ecosystem: an Integrated Environment for Big Data ( http://blog.agroknow.com/?p=3810 )
[4] 林大貴 (2016) Python+Spark 2.0+Hadoop機器學習與大數據分析實戰
[5] SPARK FOR BIG DATA ANALYTICS
( https://www.jenunderwood.com/spark-big-data-analytics-part-1/ )
[6] Martin Odersky , Lex Spoon, Bill Venners (2016) Programming in Scala, Updated for Scala 2.12.
[7] Scala – Wikipedia
( https://en.wikipedia.org/wiki/Scala_(programming_language) )
[8] Jiawei Han, Jian Pei, and Yiwen Yin. (2000) Mining Frequent Patterns without Candidate Generation
[9] Haoyuan Li et al. (2008) PFP: Parallel FP-Growth for Query Recommendation
[10] Frequent Itemset Mining Implementations Repository
( http://fimi.uantwerpen.be/ )
[11] 黃俊宏 (2018) Multi-level dataset decomposition for parallel frequent itemset mining on a cluster of personal computers
[12] Christian Borgelt (2015) Christian Borgelt's Web Pages
( http://www.borgelt.net/ )
[13] Download Apache Spark ( https://spark.apache.org/downloads.html )
[14] Building Apache Spark
( https://spark.apache.org/docs/latest/building-spark.html )
[15] Bart Goethals and Mohammed J. Zaki (2003) Advances in Frequent Itemset Mining Implementations: Report on FIMI’03

簡易檢索 / 詳目顯示

相關論文