簡易檢索 / 詳目顯示

研究生: 王東祈
Dung-Chi Wang
論文名稱: 在資料串流中尋找頻繁項目集之研究
A Research on Mining Frequent Itemsets in Data Stream
指導教授: 呂永和
Yung-Ho Lu
口試委員: 葉耀明
Yao-Min Yeh
楊維寧
Wei-Ning Yang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 58
中文關鍵詞: 資料串流探勘頻繁項目集漸增式更新
外文關鍵詞: Data Stream Mining, Frequent Itemsets, Incremental Updates
相關次數: 點閱:203下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在現今資訊爆炸的社會,傳統的資料庫與資料探勘方式已經無法達到使用者的要求,故產生了資料串流的概念。資料串流擁有資料異動頻繁、資料流入量龐大與使用者需要快速回應等特性。對此本論文提出了一個適用於資料串流的資料探勘演算法,主要是將原始資料轉換成可以儲存於記憶體的結構,而當資料流動之後我們僅需對儲存於記憶體上的資料結構做局部的更新動作,將流出的資料刪除,將流入的資料加入,而資料探勘方法可以直接在主記憶體上執行而無須對資料庫做重複的掃瞄。經過實驗證明本論文所提出的演算法的確優於目前常見的傳統演算法。


Data mining in data stream is a popular research topic in recent years. The traditional data mining algorithms for static databases cannot be efficiently used in data stream. Data stream is characterized by its high data update frequency, unlimited data size, and real-time response requirement. In this thesis, we propose a novel algorithm, called FLDS (Frequent Lists in Data Stream). FLDS stores a Count_List and a Count_Array data structures in the main-memory. The Count_List stores the current transactions’ ids (TID list) for each data item and the Count_Array stores the transaction counts for all the 1-itemsets and the 2-itemsets. When the window slides, we only need to update the two in-memory data structures by removing the transaction ids of the transactions and the transaction counts contributed by the transactions of the expired bucket and inserting the transaction ids of the transactions and the transaction counts contributed by the transactions of the incoming bucket. The searching for frequent itemsets is then performed on the in-memory data structures. By incremental updating the in-memory data structures, the FLDS algorithm avoids multiple scans of the data stream. The experiments show that the searching for frequent itemsets by using the frequent lists is very efficient. The experiments show that the FLDS algorithm is much faster than the traditional data mining algorithms for data stream mining.

目錄 I 圖表目錄 V 第一章 序論 1 1.1研究背景 1 1.2資料探勘的分類 3 1.3研究動機 4 1.4論文架構 5 第二章 相關研究 6 2.1 關聯規則 6 2.2資料串流與資料探勘 8 2.2.1 資料串流的模型 9 2.2.2 記憶體的管理 11 2.2.3 資料串流中的資料探勘 12 2.3 Apriori演算法 12 2.3.1 Apriori演算法運作 12 2.3.2 Apriori演算法之範例 15 2.3.3 Apriori演算法之缺失 17 2.4 FP-Growth演算法探討 17 2.4.1 FP-tree的建構方式 17 2.4.2 FP-Growth演算法 22 2.3.3 FP-Growth演算法之缺失 23 2.4 Frequent List 演算法 23 第三章 研究方法 26 3.1 前置處理 27 3.1.1 Count_List 28 3.1.2 Count_Array 31 3.2 研究方法 34 3.2.1 演算法步驟 34 3.2.2 演算法範例 37 第四章 效能分析與實驗結果 42 4.1實驗環境建置 42 4.1.1軟硬體環境 42 4.1.2測試資料庫 43 4.2 FLDS演算法與傳統資料探勘演算法之比較 45 4.3 FLDS之效能分析評估 48 4.3.1不同項目總數下的評估 48 4.3.2不同交易平均長度下的評估 49 4.3.3在不同框架大小下的評估 50 第五章 結論與未來展望 52 參考文獻 54

[1] R.Agawal,T. Imielinski and A. Swami.“Mining Association  Rules between Sets of Items in Large Databases,”Proc. Of ACM SIGMOD,pages 207-216,May 1993.
[2] G. Manku and R. Motwani,“Approximate Frequency Counts over Data Streams,”Proc. Of VLDB Conf,. pp.346-357,2002
[3] R.Agrawal and R. Srikant(1995),“Mining sequential patterns,”Proceeding of The International Conference of Data Engineering ,3-14.
[4] J.H.Chang and W.S.Lee,“Finding Recent Frequent Itemsets Adaptively over Online Data Stream,“Proc. Of ACM SIGKDD Conf., pp.487-492 2003.
[5] E.Cohen and M.Strauss,“Maintaining Time Decaying Stream Aggregates,”Proc. Of ACM PODS Symp.,2003.
[6] R.Agrawal and R.Srikant,“Fast Algorithms for Mining Association Rules,”Proc. Of VLDB Journal, 12(2): 120-139, 2003.
[7] A.Arasu snd G.S.Manku,“Approximate Counts and Quantiles over Sliding Windows,”Proc. Of ACM PODS Symp.,2004.
[8] D.Cheung,J.han,V.Ng, and C.Y.Wong,“Maintenance of Discovered Association Rules in Large Database: An Incremental Updating Technique,”Proc. Of ICDE Conf.,1996
[9] G.Cormode and S.Muthukrishnaan,“What’s Hot and What’s not: Tracking Most Frequent items Dynamically,”Proc. Of ACM PODS Symp., pp.296-306,2003
[10] C.Giannella, J.Han, J.Pei, X.Yan, and P.S.Yu,“Mining Frequent Patterns in Data Streams at Multiple Time Granularities,”H. Kargupta, A.Joshi, K.Sivakumar, and Y. Yesha(eds.), Next Generation Data Mining, pp.191-212,2003
[11] J.Han and K.Kamber, Simon Fraser University(2001), Data Mining:Concepts and Techniques,5nd ed., San Francisco: Morgan Kaufmann, pp.230-235
[12] A. Das, W. K. Ng and Y. K. Woon, “Association Rule Mining:Rapid Association Rule Mining,”Proceedings of the tenth international conference on Information and Knowledge Management October 2000
[13] P. Shenoy, J. R. Haritsa, S. Sundarshan, G. Bhalotia, M. Bawa and D. Shah,“Turbo-Charging Vertical Mining of Large Databases,”ACM SIGMOD Record,Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data,May 2000
[14] Yudho Giri Sucahyo, Raj P. Gopalan,“CT-ITL: Efficient Frequent Item Set Mining Using a Compressed Prefix Tree with Pattern Growth,” ACM International Conference Proceeding Series; Vol. 143, Proceedings of the 14th Australasian database conference - Volume 17, pp.95-104, 2003.
[15] 邱士軍(2002),“關聯規則演算法之實作與效能評估”,國立台灣科技大學資訊管理研究所碩士論文。
[16] 王守田(2004),“一個使用高頻串列具有可擴充性的關連規則演算法”,國立台灣科技大學資訊管理研究所碩士論文。
[17] 宋子康(2006),“資料串流中尋找關聯法則之研究”,國立台灣科技大學資訊管理研究所碩士論文。
[18] Yunyue Zhu, Dennis Shasha;StatStream:Statistical Monitoring of Thousands of Data Streams in Real Time;Int’l Conf. on Very Large Data Bases; 2002
[19] Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan; An Efficient Algorithm for Mining Frequent Itemsets over the Entrie History of Data Stream; Int’l Workshop on Knowledge Discovery in Data Stream: Sept. 2004
[20] Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying Zhou; False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Stream; Int’l Conf. on Very Large Databases; 2004
[21] Joong Hyuk Chang, Won Suk Lee;Finding Recently Ferquent Itemsets Adaptively over Online Data Streams; ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining; August 2003
[22] Joong Hyuk Chang, Won Suk Lee; A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Stream; Journal of Information Science and Engineering; July 2004
[23] Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen; Mining Frequent Itemsets from Data Streams with a Time-Sensitive Slidind Window; SIAM Int’l Conf. on Data Stream; April 2005

QR CODE