研究生: |
王浩宇 Hao-Yu Wang |
---|---|
論文名稱: |
在MapReduce中分解式頻繁項目集探勘之研究 Decomposition-Based Algorithms for Frequent Itemset Mining on MapReduce |
指導教授: |
陳省隆
Hsing-Lung Chen |
口試委員: |
陳省隆
Hsing-Lung Chen 呂政修 Jenq-Shiou Leu 陳郁堂 Yie-Tarng Chen 莊博任 Po-Jen Chuang |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 80 |
中文關鍵詞: | 大數據 、列舉 、頻繁項目集探勘 、mapreduce 、平行計算 |
外文關鍵詞: | big data, enumeration, frequent itemset mining, MapReduce, parallel computing |
相關次數: | 點閱:186 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
對於關聯規則探勘及序列樣式探勘等領域,頻繁項目集探勘是其重要的一部分。類似Apriori的平行頻繁項目集探勘演算法須要掃描資料庫多次及產生巨量候選項目集以計數其頻率,因此飽受高I/O流量及巨大記憶體使用量之苦。類似Dist-Eclat的平行頻繁項目集探勘演算法可以解決上述問題,然而對於巨量資料庫它也無法在記憶體中進行處理。
本論文提出兩個在MapReduce中分解式頻繁項目集探勘機制,以有效處理大數據的頻繁項目集探勘。
對於Dec_Bottom_Up,先將交易紀錄視為一個最小點為空集合的交易子超立方體,此交易紀錄對於交易子超立方體中的任一項目集的貢獻頻率都相同。將子超立方體切割成不重複且完整的子超立方體,藉由整合相同子超立方體的最小點之出現頻率,通過閥值的最小點即為本階段的頻繁項目集,若最小點通過閥值,再將其相同的子超立方體整合,做為下一次探勘頻繁項目集的輸入資料,重複以上步驟,直到沒有新的頻繁項目集產生,即完成探勘頻繁項目集。
對於Dec_Top_Down,本論文提出交易子超立方體概念,藉由垂直交集法由上而下快速找出所有基礎封閉頻繁項目集及其出現頻率,並使用負向表列的子超立方體表示法快速將封閉頻繁項目集轉成頻繁項目集。因此本論文的分解式頻繁項目集探勘機制能高效平行處理,大大提升系統效能。
Frequent itemset mining (FIM) is a critical problem in association rule mining(ARM), sequence mining, and the like. Apriori-like parallel FIM algorithms has to scan a database multiple times and to generate an excessive number of candidate itemsets for counting the frequencies of all the candidate itemsets, resulting in suffering potential problems of high I/O traffic and a huge amounts of memory usage. FP-growth-like parallel FIM algorithms have addressed the scalability problem. However, they are incapable of constructing in-memory FP trees to accommodate large-scale databases.
The aim of this proposal is to propose two decomposition-based algorithms for frequent itemset mining on MapReduce.
For Dec_Bottom_up, a transaction is regarded as a subcube with empty min-node, because the transaction contributes its frequency to all the itemsets of the subcube. At first, each subcube is decomposed into a set of complete and disjoint subcubes. Then, the frequency of each min-node is aggregated such that min-node with frequency above the support threshold are regarded as frequent itemsets. Each subcube with frequent min-node is aggregated and is referred to as the input of next stage. Repeat the above step until no new frequent itemsets is generated.
For Dec_Top_Down, the concept of transaction subcube is introduced such that all the prime closed frequent itemsets and their frequencies can be derived swiftly in top-down manner by employing vertical intersection method. The closed frequent itemsets can be translated into frequency itemsets quickly in top-down traveling manner with the aid of negative form of subcubes. Thus, the proposed decomposition-based algorithms can have the capability of highly parallel processing, resulting in speeding up the system performance significantly.
參考文獻
[1] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. 20th Int. Conf. Very Large Data Bases VLDB, vol. 1215, 08/11 2000.
[2] R. Hernández-León, J. Hernández-Palancar, J. Carrasco-Ochoa, and J. F. Martínez-Trinidad, "Algorithms for mining frequent itemsets in static and dynamic datasets," Intell. Data Anal., vol. 14, pp. 419-435, 01/01 2010.
[3] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Chang, Pfp: parallel fp-growth for query recommendation. 2008, pp. 107-114.
[4] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
[5] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), 2010, pp. 1-10: Ieee.
[6] M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, "Apriori-based frequent itemset mining algorithms on MapReduce," in Proceedings of the 6th international conference on ubiquitous information management and communication, 2012, pp. 1-8.
[7] Y.-J. Tsay, T.-J. Hsu, and J.-R. Yu, "FIUT: A new method for mining frequent itemsets," Information Sciences, vol. 179, pp. 1724-1737, 05/13 2009.
[8] Y. Xun, J. Zhang, and X. Qin, "Fidoop: Parallel mining of frequent itemsets using mapreduce," IEEE transactions on Systems, Man, and Cybernetics: systems, vol. 46, no. 3, pp. 313-325, 2015.
[9] I. Pramudiono and M. Kitsuregawa, FP-tax: tree structure based generalized association rule mining. 2004, pp. 60-63.
[10] Y. Xun, J. Zhang, X. Qin, and X. Zhao, "FiDoop-DP: Data partitioning in frequent itemset mining on hadoop clusters," IEEE Transactions on parallel and distributed systems, vol. 28, no. 1, pp. 101-114, 2016.
[11] M. J. Zaki, "Scalable algorithms for association mining," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 372-390, 2000.
[12] S. Moens, E. Aksehirli, and B. Goethals, "Frequent itemset mining for big data," in 2013 IEEE International Conference on Big Data, 2013, pp. 111-118: IEEE.
[13] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, "Discovering frequent closed itemsets for association rules," in International Conference on Database Theory, 1999, pp. 398-416: Springer.
[14] IBM. Intelligent information systems. Internet page. Available: http://www.almaden.ibm.com/software/quest/resources/.