簡易檢索 / 詳目顯示

研究生: 王浩宇
Hao-Yu Wang
論文名稱: 在MapReduce中分解式頻繁項目集探勘之研究
Decomposition-Based Algorithms for Frequent Itemset Mining on MapReduce
指導教授: 陳省隆
Hsing-Lung Chen
口試委員: 陳省隆
Hsing-Lung Chen
呂政修
Jenq-Shiou Leu
陳郁堂
Yie-Tarng Chen
莊博任
Po-Jen Chuang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 80
中文關鍵詞: 大數據列舉頻繁項目集探勘mapreduce平行計算
外文關鍵詞: big data, enumeration, frequent itemset mining, MapReduce, parallel computing
相關次數: 點閱:186下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 對於關聯規則探勘及序列樣式探勘等領域,頻繁項目集探勘是其重要的一部分。類似Apriori的平行頻繁項目集探勘演算法須要掃描資料庫多次及產生巨量候選項目集以計數其頻率,因此飽受高I/O流量及巨大記憶體使用量之苦。類似Dist-Eclat的平行頻繁項目集探勘演算法可以解決上述問題,然而對於巨量資料庫它也無法在記憶體中進行處理。
    本論文提出兩個在MapReduce中分解式頻繁項目集探勘機制,以有效處理大數據的頻繁項目集探勘。
    對於Dec_Bottom_Up,先將交易紀錄視為一個最小點為空集合的交易子超立方體,此交易紀錄對於交易子超立方體中的任一項目集的貢獻頻率都相同。將子超立方體切割成不重複且完整的子超立方體,藉由整合相同子超立方體的最小點之出現頻率,通過閥值的最小點即為本階段的頻繁項目集,若最小點通過閥值,再將其相同的子超立方體整合,做為下一次探勘頻繁項目集的輸入資料,重複以上步驟,直到沒有新的頻繁項目集產生,即完成探勘頻繁項目集。
    對於Dec_Top_Down,本論文提出交易子超立方體概念,藉由垂直交集法由上而下快速找出所有基礎封閉頻繁項目集及其出現頻率,並使用負向表列的子超立方體表示法快速將封閉頻繁項目集轉成頻繁項目集。因此本論文的分解式頻繁項目集探勘機制能高效平行處理,大大提升系統效能。


    Frequent itemset mining (FIM) is a critical problem in association rule mining(ARM), sequence mining, and the like. Apriori-like parallel FIM algorithms has to scan a database multiple times and to generate an excessive number of candidate itemsets for counting the frequencies of all the candidate itemsets, resulting in suffering potential problems of high I/O traffic and a huge amounts of memory usage. FP-growth-like parallel FIM algorithms have addressed the scalability problem. However, they are incapable of constructing in-memory FP trees to accommodate large-scale databases.
    The aim of this proposal is to propose two decomposition-based algorithms for frequent itemset mining on MapReduce.
    For Dec_Bottom_up, a transaction is regarded as a subcube with empty min-node, because the transaction contributes its frequency to all the itemsets of the subcube. At first, each subcube is decomposed into a set of complete and disjoint subcubes. Then, the frequency of each min-node is aggregated such that min-node with frequency above the support threshold are regarded as frequent itemsets. Each subcube with frequent min-node is aggregated and is referred to as the input of next stage. Repeat the above step until no new frequent itemsets is generated.

    For Dec_Top_Down, the concept of transaction subcube is introduced such that all the prime closed frequent itemsets and their frequencies can be derived swiftly in top-down manner by employing vertical intersection method. The closed frequent itemsets can be translated into frequency itemsets quickly in top-down traveling manner with the aid of negative form of subcubes. Thus, the proposed decomposition-based algorithms can have the capability of highly parallel processing, resulting in speeding up the system performance significantly.

    目錄 致謝 2 摘要 3 ABSTRACT 5 Chapter 1 緒論 12 1.1 研究背景 12 1.2 研究應用 13 1.3 研究目的 14 Chapter 2 相關研究 16 2.1 Hadoop架構 16 2.1.1 MapReduce 17 2.1.2 Hadoop MapReduce執行過程 17 2.2 文獻探討 19 2.2.1 Apriori演算法 19 2.2.2 FIUT演算法 21 2.2.3 FP-growth演算法 23 2.2.4 Eclat演算法 26 Chapter 3 演算法設計架構 28 3.1 Dec_Bottom_up演算法 28 3.1.1 超立方體表示法(Subcubes Representation) 28 3.1.2 Dec_Bottom_up處理流程 31 3.2 Dec_Top_Down演算法 37 3.2.1 封閉頻繁項目集(Closed Frequent Itemset) 37 3.2.2 基礎封閉頻繁項目集(Prime Closed Frequent Itemset) 38 3.2.3 交易子超立方體(Transaction Subcube) 38 3.2.4 虛擬子超立方體(Virtual Subcube) 38 3.2.5 子超立方體之最大點與基礎封閉頻繁項目集之關係 40 3.2.6 Dec_Top_Down設計概念 41 3.2.7 垂直交集法(Vertical Intersection) 44 3.2.8 Dec_Top_Down處理流程 47 Chapter 4 實驗與分析 59 4.1 實驗環境 59 4.2 實驗輸入資料 59 4.3 演算法細部耗時分析 60 4.3.1 Dist-Eclat演算法 60 4.3.2 Dec_Bottom_Up演算法 61 4.3.3 Dec_Top_Down演算法 63 4.4 實驗結果與分析 65 Chapter 5 結論與未來展望 76 參考文獻 79 圖表目錄 圖2- 1 Hadoop 核心架構 16 圖2- 2 Hadoop MapReduce執行過程 17 圖2- 3 Apriori Property 19 圖2- 4 Apriori Pruning Principle 20 圖2- 5 FIUT演算法交易記錄整理舉例 21 圖2- 6 FIU樹提取頻繁項目集(最小支持度為3) 22 圖2- 7提取條件頻繁樣式樹遞迴關係 24 圖2- 8 Pfp演算法舉例(最小支持度為3) 25 圖2- 9資料庫垂直表示法示意圖 26 圖3- 1交易紀錄{a, b, c}及其子集合 28 圖3- 2 Bottom up搜索空間切割式意圖 30 圖3- 3將交易記錄t進行預先處理並轉換成超立方體表示法 34 圖3- 4探勘頻繁k項目集 36 圖3- 5 交易子超立方體交集產生虛擬子超立方體 39 圖3- 6垂直表示法 44 圖3- 7垂直交集法舉例 46 圖3- 8將交易紀錄t進行預先處理並轉成垂直表示法 48 圖3- 9找出所有子超立方體 51 圖3- 10求得封閉頻繁項目集並轉換成頻繁項目集 53 圖3- 11負向表示的子超立方體表示法空間切割示意圖 55 圖4- 1 Dist-Eclat演算法各部分執行耗時情況 61 圖4- 2 Dec_Bottom_Up演算法各部分執行耗時情況 62 圖4- 3 Dec_Top_Down演算法各部分執行耗時情況 64 圖4- 4各個演算法在Dataset 1的運算時間 66 圖4- 5 Dist-Eclat候選頻繁項目集Dataset 1輸出情況 67 圖4- 6 Dec_Bottom_Up子超立方體Dataset 1輸出情況 67 圖4- 7 Dec_Top_Down基礎頻繁項目集Dataset 1輸出情況 67 圖4- 8各個演算法在Dataset 2的運算時間 69 圖4- 9 Dist-Eclat候選頻繁項目集Dataset 2輸出情況 69 圖4- 10 Dec_Bottom_Up子超立方體Dataset 2輸出情況 70 圖4- 11 Dec_Top_Down基礎頻繁項目集Dataset 2輸出情況 70 圖4- 12各個演算法在Dataset 3的運算時間 71 圖4- 13 Dist-Eclat候選頻繁項目集Dataset 3輸出情況 72 圖4- 14 Dec_Bottom_Up子超立方體Dataset 3輸出情況 72 圖4- 15 Dec_Top_Down基礎頻繁項目集Dataset 3輸出情況 73 圖4- 16各個演算法在Dataset 4的運算時間 74 圖4- 17 Dist-Eclat候選頻繁項目集Dataset 4輸出情況 74 圖4- 18 Dec_Bottom_Up子超立方體Dataset 4輸出情況 75 圖4- 19 Dec_Top_Down基礎頻繁項目集Dataset 4輸出情況 75 表4- 1實驗環境 59 表4- 2實驗輸入資料參數 60

    參考文獻
    [1] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. 20th Int. Conf. Very Large Data Bases VLDB, vol. 1215, 08/11 2000.
    [2] R. Hernández-León, J. Hernández-Palancar, J. Carrasco-Ochoa, and J. F. Martínez-Trinidad, "Algorithms for mining frequent itemsets in static and dynamic datasets," Intell. Data Anal., vol. 14, pp. 419-435, 01/01 2010.
    [3] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Chang, Pfp: parallel fp-growth for query recommendation. 2008, pp. 107-114.
    [4] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
    [5] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), 2010, pp. 1-10: Ieee.
    [6] M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, "Apriori-based frequent itemset mining algorithms on MapReduce," in Proceedings of the 6th international conference on ubiquitous information management and communication, 2012, pp. 1-8.
    [7] Y.-J. Tsay, T.-J. Hsu, and J.-R. Yu, "FIUT: A new method for mining frequent itemsets," Information Sciences, vol. 179, pp. 1724-1737, 05/13 2009.
    [8] Y. Xun, J. Zhang, and X. Qin, "Fidoop: Parallel mining of frequent itemsets using mapreduce," IEEE transactions on Systems, Man, and Cybernetics: systems, vol. 46, no. 3, pp. 313-325, 2015.
    [9] I. Pramudiono and M. Kitsuregawa, FP-tax: tree structure based generalized association rule mining. 2004, pp. 60-63.
    [10] Y. Xun, J. Zhang, X. Qin, and X. Zhao, "FiDoop-DP: Data partitioning in frequent itemset mining on hadoop clusters," IEEE Transactions on parallel and distributed systems, vol. 28, no. 1, pp. 101-114, 2016.
    [11] M. J. Zaki, "Scalable algorithms for association mining," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 372-390, 2000.
    [12] S. Moens, E. Aksehirli, and B. Goethals, "Frequent itemset mining for big data," in 2013 IEEE International Conference on Big Data, 2013, pp. 111-118: IEEE.
    [13] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, "Discovering frequent closed itemsets for association rules," in International Conference on Database Theory, 1999, pp. 398-416: Springer.
    [14] IBM. Intelligent information systems. Internet page. Available: http://www.almaden.ibm.com/software/quest/resources/.

    QR CODE