研究生: |
戴佑軒 Yu-Hsuan Tai |
---|---|
論文名稱: |
在Spark叢集運算框架下採用二次切割的頻繁項目集演算法 A Two-Level Partition Based Frequent Itemset Mining Algorithm in Spark |
指導教授: |
呂永和
Yung-Ho Leu |
口試委員: |
楊維寧
Wei-Ning Yang 陳雲岫 Yun-Shiow Chen |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 94 |
中文關鍵詞: | 大數據 、資料探勘 、關聯規則 、頻繁項目 、FP-Growth 、FP-Tree 、Spark 、MLlib 、MapReduce |
外文關鍵詞: | Big Data, Data Mining, Association Rule, Frequent Pattern, FP-Growth, FP-Tree, Spark, MLlib, MapReduce |
相關次數: | 點閱:302 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
我們正處於資訊爆炸的時代,「大數據(BigData)」相關應用議題已成顯學,掌握數據,可以洞悉過去、展望未來,藉以暸解發展的趨勢、把握可能的商機,是無論任何領域,都不能錯過的一門新學問,「關聯規則」是大數據分析裡重要的一部分,在商業決策、文字探勘、資訊安全等應用上頗具潛力,而頻繁項目集問題作為關聯性分析的基石,如何更準確、有效率的挖掘頻繁項目集,自然成為日益受到重視的議題。
FP-Gorwth演算法是頻繁項目集挖掘的經典演算法之一,透過建立FP-Tree的方式,在不產生候選項目的情況下產生頻繁項目集,在Apache Spark這個開源叢集運算框架中也內建了一套採用MapReduce架構的平行化的FP-Growth演算法供數據分析時使用,透過切割資料集為任務獨立子資料集的方式,使FP-Tree的挖掘可交由多個運算節點平行處理,大幅提升了大型資料集的頻繁項目挖掘效率。
本論文針對平行運算在面對資料傾斜問題可能造成的瓶頸,採用子資料集二次切割的概念,針對挖掘頻繁項目集過程中,可能需要較長運算時間的子資料集進行再切割,對Apache Spark的平行化FP-Growth演算法進行改進,經由多個資料集實驗後的結果顯示,此方案在較低最小支持度的條件下,可更充分的發揮叢集運算的優勢,取得效率上的改進。
We live in an era of information explosion. Topics related to Big Data and its applications have been of broad and current interest. Through data, we can both look back in time and look into the future, understanding trends and grasping possible business opportunities. Association rule mining plays an important role in big data analytics and has potential applications in business decision making, text mining and information security. Since the frequent itemset mining problem is the foundations of association rule analysis, how to find frequent itemsets efficiently has become an important issue.
The FP-Growth algorithm is one of the fast algorithms for frequent itemset mining. Through the construction of an FP-Tree, the algorithm generates frequent itemsets without generating candidate itemsets. A built-in parallel version of FP-Growth algorithm is provided in Apache Spark which is an open-source cloud computing framework. By dividing datasets into task independent sub-datasets, the mining task can be processed by multiple computational nodes in parallel.
In this thesis, we propose an improvement on the parallel FP-Growth algorithm in Apache Spark to resolve the possible bottleneck in parallel processing caused by data skew. The solution adopts the concept of re-partition of sub-datasets which might take longer time in frequent itemset mining. Experimental results on multiple datasets showed that our method can resolve the bottleneck in parallel frequent itemset mining so as to reduce the total computation time for frequent itemset mining with small minimum supports.
[1] Daniel J. Power (2002) DSS News, A Bi-Weekly Publication of DSSResources.com ( http://www.dssresources.com/newsletters/66.php )
[2] Chih-Cheng Liang (2016) 莫再提了!啤酒尿布是都市傳說
( https://medium.com/@chihchengliang/67b96e01b3f0 )
[3] Hadoop Ecosystem: an Integrated Environment for Big Data ( http://blog.agroknow.com/?p=3810 )
[4] 林大貴 (2016) Python+Spark 2.0+Hadoop機器學習與大數據分析實戰
[5] SPARK FOR BIG DATA ANALYTICS
( https://www.jenunderwood.com/spark-big-data-analytics-part-1/ )
[6] Martin Odersky , Lex Spoon, Bill Venners (2016) Programming in Scala, Updated for Scala 2.12.
[7] Scala – Wikipedia
( https://en.wikipedia.org/wiki/Scala_(programming_language) )
[8] Jiawei Han, Jian Pei, and Yiwen Yin. (2000) Mining Frequent Patterns without Candidate Generation
[9] Haoyuan Li et al. (2008) PFP: Parallel FP-Growth for Query Recommendation
[10] Frequent Itemset Mining Implementations Repository
( http://fimi.uantwerpen.be/ )
[11] 黃俊宏 (2018) Multi-level dataset decomposition for parallel frequent itemset mining on a cluster of personal computers
[12] Christian Borgelt (2015) Christian Borgelt's Web Pages
( http://www.borgelt.net/ )
[13] Download Apache Spark ( https://spark.apache.org/downloads.html )
[14] Building Apache Spark
( https://spark.apache.org/docs/latest/building-spark.html )
[15] Bart Goethals and Mohammed J. Zaki (2003) Advances in Frequent Itemset Mining Implementations: Report on FIMI’03