在Spark叢集運算框架中頻繁項目探勘演算法的效能比較｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	沈尚宜 Shang-Yi Shen
論文名稱：	在Spark叢集運算框架中頻繁項目探勘演算法的效能比較 Performance comparison of different frequent itemset mining algorithms in Spark cluster computing framework
指導教授：	呂永和 Yung-Ho Leu
口試委員:	呂永和 Yung-Ho Leu 楊維寧 Wei-Ning Yang 陳雲岫 Yun-Shiow Chen
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理系 Department of Information Management
論文出版年：	2018
畢業學年度：	106
語文別：	中文
論文頁數：	97頁
中文關鍵詞：	高頻項目、FP-Growth 、資料探勘、Apriori 、Spark 、RDD
外文關鍵詞：	Frequent itemset, Data mining, FP-Growth, Apriori, Spark, RDD
相關次數：	點閱：228 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著資訊科技的日新月異以及社群媒體的興起，電腦上所累積的資料量呈現指數性地成長，使得巨量資料、雲端運算、資料探勘等變成現今最熱門的議題。為了尋找交易資料集中資料項目之間的相關性，學者們提出了高頻項目集的探勘方法。由於一般的高頻項目集的探勘方法在處理巨量資料集的問題上效率不佳，學者便將高頻項目集探勘方法平行化；而Spark是目前平行化的最佳雲端計算平台，它的核心技術為可彈性分散式資料集(Resilient Distributed Dataset)，可將資料集切割成數個較小的資料集，並分散儲存於多個工作節點之中，藉由紀錄資料集的演變過程(Data Lineage)以及提供以主記憶體為主的平行運算，提供一個將演算法平行化的雲端高速計算環境。
在本研究中，我們將比較兩個主要的高頻項目集演算法FP-Growth與Apriori的平行化版本在於Spark環境中的執行效能比較，並從中探討在不同執行條件下的優劣。從實驗結果發現，由於讀取資料集次數的差異，使得FP-Growth不論在任何執行條件下，執行效率皆優於Apriori；但是在記憶體的使用上，由於FP-Growth必須產生FP-Tree，當數據越複雜時會造成FP-Tree相對複雜，因此在主記憶體上使用量較多。此外，從本研究的實驗結果亦發現，工作節點數越多運算速度就越快，由於負載平衡不一的關係，因此每增加一個節點執行速度只有加快約10%左右；我們同時發現，資料群集的分組數目與執行速度並無直接關係，而是取決於資料集的複雜程度。

關鍵字 : 高頻項目、資料探勘、FP-Growth、Apriori、Spark、RDD

With the rapid development of information technology and the prevalence of social media, the amount of data generated on the web has been increasing in a rapid speed. To support and leverage the data on the web, big data, cloud computing and data mining have become popular research issues recently. Among different data mining techniques, the frequent itemset mining algorithm is devised to find the correlations among different data items in a large transactional dataset. Since the traditional frequent itemset mining algorithms are inadequate for large datasets, researchers have parallelized the frequent itemset mining algorithms to reduce their execution times on large datasets.
Spark is a powerful cloud computing platform for parallel processing. With its resilient distributed dataset (RDD) technology, a large dataset can be partitioned into a set of finer sub-datasets which are stored in different worker nodes to promote parallel processing. By keeping the lineage of an RDD, Spark can re-build an RDD when the RDD is lost. With the fault tolerant capability, Spark allows the RDDs to reside in main memory to support in-memory computation rather than to load the RDDs from the disk when they are required. In this thesis, we compare the performance of two major frequent itemset mining algorithms, Apriori and FP-Growth, in the Spark computing platform. We compare the performance of the paralleled implementation of FP-Growth, the PFP, and the paralleled implementation of the Apriori in Spark under different parameter settings. The experimental results showed that the PFP outperformed the paralleled Apriori in all the different settings of the experimental parameters. As to the memory usage, the PFP required more memory than that of the paralleled Apriori when the transactional dataset is large. This is because the PFP needs to build FP-Trees in memory and the FP-Trees consume large amount of memory when the dataset is large. In addition, the experimental results also showed that the speed-ups of the execution with different numbers of worker nodes are not as good as expected. Due to the different load balancing, the execution time was reduced by only 10% as one extra worker node is added in the Spark. We also found that the number of partitions on the dataset has no significant effect on the execution time of the algorithms.
Keywords : Frequent itemset , Data mining , FP-Growth , Apriori , Spark , RDD

摘 要    I
ABSTRACT    II
誌 謝    IV
目 錄    V
圖目錄    VII
表目錄    X
第一章 緒論    1
1 研究背景    1
2 研究動機與目的    3
3 本文架構    4
第二章 文獻探討    5
1 大數據(big data)    5
1.1    大數據的挖掘    7
1.2    大數據的應用與挑戰    8
2 關聯規則    9
2.1 關聯規則背景介紹    9
2.2    關聯規則的定義    10
2.3    關聯規則的推導    11
3 Divide and Conquer    13
3.1 Divide and Conquer 遞迴問題    13
4 Hadoop    15
4.1 Hadoop HDFS分散式檔案系統    16
5 Apache Spark    18
5.1 Spark主要功能    19
5.2 Spark數據操作方式    20
6 Apriori演算法    23
6.1 Apriori介紹與探勘高頻項目方法    23
6.2 Parallel Apriori    30
6.3 Spark中建置Apriori演算法    31
7    FP-Growth演算法    32
7.1 FP-Tree建構與說明    32
7.2 FP-Growth探勘高頻項目方法    40
8 Spark中建置Parallel FP-Growth演算法    49
8.1 Prefix path分群    51
8.2 PFP-Growth建構方法    54
第三章 研究方法    57
1 比較種類與目的    57
2 比較方法    60
第四章 實驗結果與分析    64
1 實驗測試環境    64
2 Q-List消耗資源分析    65
3 PFP-Growth與Apriori之比較    67
3.1 FP-Growth與Apriori記憶體用量比較    68
3.2 PFP-Growth與Apriori於Spark中執行效能比較    70
3.3 Apriori中不同Iteration比較    71
4 PFP-Growth在不同參數下的執行效能比較    73
4.1 固定minsupport比較在不同partition之下的執行時間    73
4.2 在不同數量的節點上執行速度比較    76
5 FP-Growth與PFP-Growth效能比較    78
第五章 結論與未來方向    80
1 結論    80
2 未來方向    81
參考文獻    82






                                

[1] Jewel(2015)。巨量資料的時代，用「大、快、雜、疑」四字箴言帶你認識大數據。
[2] Titan(2014)。台大林智仁教授談大數據分析的挑戰與機會。
[3] 王金龍,銘傳大學學術副校長(2015.7)。銘傳Moodle大數據分析與學生學習成效。評鑑雙月刊第56期。
[4] 翁政雄(2011)。"從購買意願資料中挖掘高度相關性的關聯規則"。中臺科技大學資訊管理系研討會論文。
[5] Agrawal, R., T. Imielinski and A. Swami (1993) Mining association rules between sets of items in large database. Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C.
[6] Susan Li (2017).A Gentle Introduction on Market Basket Analysis - Association Rules. R news and tutorials contributed by (750) R bloggers.
[7] 巨量資料的基本概念,http://epaper.gotop.com.tw/PDFSample/AEY035100.pdf
[8] Hong, T.P., Kuo, C.S. and Chi, S.C＂Mining association rules from quantitative data,＂Intelligent Data Analysis (3:5), 1999, pp. 363-376.
[9] J. Zakir, T. Seymour and K. Berg. Big Data Analytics. Journal of Information Systems. Vol. 16, No. 2, 81--90, 2015.
[10]: A. M. Al-Salim, T. El-Gorashi, A. Q. Lawey, and J. M. Elmirghani, "Greening Big Data Networks: Velocity Impact," IET Optoelectronics. 2017.
[11]Luo et al. 2012, Dijun Luo, Chris Ding, Heng Huang, Parallelization with Multiplicative Algorithms for Big Data Mining, In: Proc. of IEEE 12th International Conference on Data Mining, pp.489-498, 2012.
[12] Z. C. Li, P. L. He and M. Lei, “A High Efficient AprioriTid Algorithm for Mining Association Rule, "Proceedings of the Fourth International Conference on Machine Learning and Cybernetics", Guangzhou, China, 2005.
[13] Performance comparison of Apriori a DANIEL HUNYADI Department of Computer Science ”Lucian Blaga” University of Sibiu, Romania nd FP-Growth algorithms in generating association rules.
[14]Han J., Pei J., Yin Y. and Mao R., Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data Mining and Knowledge Discovery, 2003.
[15] J. Han J. Pei and Y. Yin. Minning frequent pattern without candidate generation. In Proc. 2000 ACMSIGMOD Int. Conf. on Management of Data. ACM, 2000
[16]Induction of Association Rules: Apriori Implementation Christian Borgelt and Rudolf Kruse Department of Knowledge Processing and Language Engineering
School of Computer Science Otto-von-Guericke-University of Magdeburg
University atsplatz 2, D-39106 Magdeburg, Germany.
[17] Li, Z. C., P. L. He and M. Lei (2005) A high efficient AprioriTid algorithm for mining association rule.Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, China.
[18] Data Mining: Concepts and Techniques, J. Han and M. Kimber, eds. Morgan Kaufman, 2000.
[19] 陳偉康(2014)。淺談「窮舉法」。香港教育學院數學與資訊科技學系。數學教育第三十六期。
[20] Apache Hadoop, https://baike.baidu.com/item/Apache%20Hadoop/18754243
[21] Li H, Wang Y, Zhang D et al. PFP: Parallel FP-Growth for query recommendation. In Proc. the ACM Conference on Recommender Systems, Oct. 2008, pp.107-114.
[22] QIU H,GU R,YUAN C,et a1.YAFIM:A parallel frequent itemset mining algorithm with Spark(c)//2014 IEEE International Parallel & Distributed Processing Symposium Work.Shops.(S.1.):IEEE,2014:1664-1671.
[23] Apache Spark –Introduction,
https://www.tutorialspoint.com/apache_spark/apache_spark_introduction.htm
[24] R. Joy and K. K. Sherly, "Parallel frequent itemset mining with spark RDD framework for disease prediction," 2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT), Nagercoil, 2016, pp. 1-5.
[25] R. Mishra and A. Choubey, “Discovery of frequent patterns from web log data by using FP-growth algorithm for web usage mining,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 2, no. 9, pp. 311– 318, 2012.
[26] 方向，张功萱(2016)。基于Spark的PFP—Growth并行算法优化实现。南京理工大学计算机科学与技术学院，江苏南京。
[27] Yanjun(2014)。RDD:基於內存的集群計算容錯抽象。
[28] Li N，Zeng L，He Q，et al． Parallel Implementation of Apriori Algorithm Based on MapＲeduce / / Proc of the 13th ACIS International Conference on Software Engineering，Artificial Intelligence， Networking and Parallel /Distributed Computing． Kyoto，Japan， 2012: 191 － 200.
[29] 闫梦洁，罗军，刘建英侯传旺。 IABS:一个基于Spark的Apriori改进算法国防科学技术大学计算机学院，长沙。

全文公開日期 2023/08/09 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文