在MapReduce中分解式頻繁項目集探勘之研究｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	王浩宇 Hao-Yu Wang
論文名稱：	在MapReduce中分解式頻繁項目集探勘之研究 Decomposition-Based Algorithms for Frequent Itemset Mining on MapReduce
指導教授：	陳省隆 Hsing-Lung Chen
口試委員:	陳省隆 Hsing-Lung Chen 呂政修 Jenq-Shiou Leu 陳郁堂 Yie-Tarng Chen 莊博任 Po-Jen Chuang
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	中文
論文頁數：	80
中文關鍵詞：	大數據、列舉、頻繁項目集探勘、mapreduce 、平行計算
外文關鍵詞：	big data, enumeration, frequent itemset mining, MapReduce, parallel computing
相關次數：	點閱：186 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

對於關聯規則探勘及序列樣式探勘等領域，頻繁項目集探勘是其重要的一部分。類似Apriori的平行頻繁項目集探勘演算法須要掃描資料庫多次及產生巨量候選項目集以計數其頻率，因此飽受高I/O流量及巨大記憶體使用量之苦。類似Dist-Eclat的平行頻繁項目集探勘演算法可以解決上述問題，然而對於巨量資料庫它也無法在記憶體中進行處理。
本論文提出兩個在MapReduce中分解式頻繁項目集探勘機制，以有效處理大數據的頻繁項目集探勘。
對於Dec_Bottom_Up，先將交易紀錄視為一個最小點為空集合的交易子超立方體，此交易紀錄對於交易子超立方體中的任一項目集的貢獻頻率都相同。將子超立方體切割成不重複且完整的子超立方體，藉由整合相同子超立方體的最小點之出現頻率，通過閥值的最小點即為本階段的頻繁項目集，若最小點通過閥值，再將其相同的子超立方體整合，做為下一次探勘頻繁項目集的輸入資料，重複以上步驟，直到沒有新的頻繁項目集產生，即完成探勘頻繁項目集。
對於Dec_Top_Down，本論文提出交易子超立方體概念，藉由垂直交集法由上而下快速找出所有基礎封閉頻繁項目集及其出現頻率，並使用負向表列的子超立方體表示法快速將封閉頻繁項目集轉成頻繁項目集。因此本論文的分解式頻繁項目集探勘機制能高效平行處理，大大提升系統效能。

Frequent itemset mining (FIM) is a critical problem in association rule mining(ARM), sequence mining, and the like. Apriori-like parallel FIM algorithms has to scan a database multiple times and to generate an excessive number of candidate itemsets for counting the frequencies of all the candidate itemsets, resulting in suffering potential problems of high I/O traffic and a huge amounts of memory usage. FP-growth-like parallel FIM algorithms have addressed the scalability problem. However, they are incapable of constructing in-memory FP trees to accommodate large-scale databases.
The aim of this proposal is to propose two decomposition-based algorithms for frequent itemset mining on MapReduce.
For Dec_Bottom_up, a transaction is regarded as a subcube with empty min-node, because the transaction contributes its frequency to all the itemsets of the subcube. At first, each subcube is decomposed into a set of complete and disjoint subcubes. Then, the frequency of each min-node is aggregated such that min-node with frequency above the support threshold are regarded as frequent itemsets. Each subcube with frequent min-node is aggregated and is referred to as the input of next stage. Repeat the above step until no new frequent itemsets is generated.

For Dec_Top_Down, the concept of transaction subcube is introduced such that all the prime closed frequent itemsets and their frequencies can be derived swiftly in top-down manner by employing vertical intersection method. The closed frequent itemsets can be translated into frequency itemsets quickly in top-down traveling manner with the aid of negative form of subcubes. Thus, the proposed decomposition-based algorithms can have the capability of highly parallel processing, resulting in speeding up the system performance significantly.

目錄
致謝    2
摘要    3
ABSTRACT    5
Chapter 1    緒論    12
1.1    研究背景    12
1.2    研究應用    13
1.3    研究目的    14
Chapter 2    相關研究    16
2.1    Hadoop架構    16
2.1.1    MapReduce    17
2.1.2    Hadoop MapReduce執行過程    17
2.2    文獻探討    19
2.2.1    Apriori演算法    19
2.2.2    FIUT演算法    21
2.2.3    FP-growth演算法    23
2.2.4    Eclat演算法    26
Chapter 3    演算法設計架構    28
3.1    Dec_Bottom_up演算法    28
3.1.1    超立方體表示法(Subcubes Representation)    28
3.1.2    Dec_Bottom_up處理流程    31
3.2    Dec_Top_Down演算法    37
3.2.1    封閉頻繁項目集(Closed Frequent Itemset)    37
3.2.2    基礎封閉頻繁項目集(Prime Closed Frequent Itemset)    38
3.2.3    交易子超立方體(Transaction Subcube)    38
3.2.4    虛擬子超立方體(Virtual Subcube)    38
3.2.5    子超立方體之最大點與基礎封閉頻繁項目集之關係    40
3.2.6    Dec_Top_Down設計概念    41
3.2.7    垂直交集法(Vertical Intersection)    44
3.2.8    Dec_Top_Down處理流程    47
Chapter 4    實驗與分析    59
4.1    實驗環境    59
4.2    實驗輸入資料    59
4.3    演算法細部耗時分析    60
4.3.1    Dist-Eclat演算法    60
4.3.2    Dec_Bottom_Up演算法    61
4.3.3    Dec_Top_Down演算法    63
4.4    實驗結果與分析    65
Chapter 5    結論與未來展望    76
參考文獻    79

圖表目錄
圖2- 1 Hadoop 核心架構    16
圖2- 2 Hadoop MapReduce執行過程    17
圖2- 3 Apriori Property    19
圖2- 4 Apriori Pruning Principle    20
圖2- 5 FIUT演算法交易記錄整理舉例    21
圖2- 6 FIU樹提取頻繁項目集(最小支持度為3)    22
圖2- 7提取條件頻繁樣式樹遞迴關係    24
圖2- 8 Pfp演算法舉例(最小支持度為3)    25
圖2- 9資料庫垂直表示法示意圖    26
圖3- 1交易紀錄{a, b, c}及其子集合    28
圖3- 2 Bottom up搜索空間切割式意圖    30
圖3- 3將交易記錄t進行預先處理並轉換成超立方體表示法    34
圖3- 4探勘頻繁k項目集    36
圖3- 5 交易子超立方體交集產生虛擬子超立方體    39
圖3- 6垂直表示法    44
圖3- 7垂直交集法舉例    46
圖3- 8將交易紀錄t進行預先處理並轉成垂直表示法    48
圖3- 9找出所有子超立方體    51
圖3- 10求得封閉頻繁項目集並轉換成頻繁項目集    53
圖3- 11負向表示的子超立方體表示法空間切割示意圖    55
圖4- 1 Dist-Eclat演算法各部分執行耗時情況    61
圖4- 2 Dec_Bottom_Up演算法各部分執行耗時情況    62
圖4- 3 Dec_Top_Down演算法各部分執行耗時情況    64
圖4- 4各個演算法在Dataset 1的運算時間    66
圖4- 5 Dist-Eclat候選頻繁項目集Dataset 1輸出情況    67
圖4- 6 Dec_Bottom_Up子超立方體Dataset 1輸出情況    67
圖4- 7 Dec_Top_Down基礎頻繁項目集Dataset 1輸出情況    67
圖4- 8各個演算法在Dataset 2的運算時間    69
圖4- 9 Dist-Eclat候選頻繁項目集Dataset 2輸出情況    69
圖4- 10 Dec_Bottom_Up子超立方體Dataset 2輸出情況    70
圖4- 11 Dec_Top_Down基礎頻繁項目集Dataset 2輸出情況    70
圖4- 12各個演算法在Dataset 3的運算時間    71
圖4- 13 Dist-Eclat候選頻繁項目集Dataset 3輸出情況    72
圖4- 14 Dec_Bottom_Up子超立方體Dataset 3輸出情況    72
圖4- 15 Dec_Top_Down基礎頻繁項目集Dataset 3輸出情況    73
圖4- 16各個演算法在Dataset 4的運算時間    74
圖4- 17 Dist-Eclat候選頻繁項目集Dataset 4輸出情況    74
圖4- 18 Dec_Bottom_Up子超立方體Dataset 4輸出情況    75
圖4- 19 Dec_Top_Down基礎頻繁項目集Dataset 4輸出情況    75
表4- 1實驗環境    59
表4- 2實驗輸入資料參數    60


                                

參考文獻
[1] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. 20th Int. Conf. Very Large Data Bases VLDB, vol. 1215, 08/11 2000.
[2] R. Hernández-León, J. Hernández-Palancar, J. Carrasco-Ochoa, and J. F. Martínez-Trinidad, "Algorithms for mining frequent itemsets in static and dynamic datasets," Intell. Data Anal., vol. 14, pp. 419-435, 01/01 2010.
[3] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Chang, Pfp: parallel fp-growth for query recommendation. 2008, pp. 107-114.
[4] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
[5] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), 2010, pp. 1-10: Ieee.
[6] M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, "Apriori-based frequent itemset mining algorithms on MapReduce," in Proceedings of the 6th international conference on ubiquitous information management and communication, 2012, pp. 1-8.
[7] Y.-J. Tsay, T.-J. Hsu, and J.-R. Yu, "FIUT: A new method for mining frequent itemsets," Information Sciences, vol. 179, pp. 1724-1737, 05/13 2009.
[8] Y. Xun, J. Zhang, and X. Qin, "Fidoop: Parallel mining of frequent itemsets using mapreduce," IEEE transactions on Systems, Man, and Cybernetics: systems, vol. 46, no. 3, pp. 313-325, 2015.
[9] I. Pramudiono and M. Kitsuregawa, FP-tax: tree structure based generalized association rule mining. 2004, pp. 60-63.
[10] Y. Xun, J. Zhang, X. Qin, and X. Zhao, "FiDoop-DP: Data partitioning in frequent itemset mining on hadoop clusters," IEEE Transactions on parallel and distributed systems, vol. 28, no. 1, pp. 101-114, 2016.
[11] M. J. Zaki, "Scalable algorithms for association mining," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 372-390, 2000.
[12] S. Moens, E. Aksehirli, and B. Goethals, "Frequent itemset mining for big data," in 2013 IEEE International Conference on Big Data, 2013, pp. 111-118: IEEE.
[13] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, "Discovering frequent closed itemsets for association rules," in International Conference on Database Theory, 1999, pp. 398-416: Springer.
[14] IBM. Intelligent information systems. Internet page. Available: http://www.almaden.ibm.com/software/quest/resources/.

簡易檢索 / 詳目顯示

相關論文