研究生: |
王東祈 Dung-Chi Wang |
---|---|
論文名稱: |
在資料串流中尋找頻繁項目集之研究 A Research on Mining Frequent Itemsets in Data Stream |
指導教授: |
呂永和
Yung-Ho Lu |
口試委員: |
葉耀明
Yao-Min Yeh 楊維寧 Wei-Ning Yang |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 中文 |
論文頁數: | 58 |
中文關鍵詞: | 資料串流探勘 、頻繁項目集 、漸增式更新 |
外文關鍵詞: | Data Stream Mining, Frequent Itemsets, Incremental Updates |
相關次數: | 點閱:203 下載:11 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在現今資訊爆炸的社會,傳統的資料庫與資料探勘方式已經無法達到使用者的要求,故產生了資料串流的概念。資料串流擁有資料異動頻繁、資料流入量龐大與使用者需要快速回應等特性。對此本論文提出了一個適用於資料串流的資料探勘演算法,主要是將原始資料轉換成可以儲存於記憶體的結構,而當資料流動之後我們僅需對儲存於記憶體上的資料結構做局部的更新動作,將流出的資料刪除,將流入的資料加入,而資料探勘方法可以直接在主記憶體上執行而無須對資料庫做重複的掃瞄。經過實驗證明本論文所提出的演算法的確優於目前常見的傳統演算法。
Data mining in data stream is a popular research topic in recent years. The traditional data mining algorithms for static databases cannot be efficiently used in data stream. Data stream is characterized by its high data update frequency, unlimited data size, and real-time response requirement. In this thesis, we propose a novel algorithm, called FLDS (Frequent Lists in Data Stream). FLDS stores a Count_List and a Count_Array data structures in the main-memory. The Count_List stores the current transactions’ ids (TID list) for each data item and the Count_Array stores the transaction counts for all the 1-itemsets and the 2-itemsets. When the window slides, we only need to update the two in-memory data structures by removing the transaction ids of the transactions and the transaction counts contributed by the transactions of the expired bucket and inserting the transaction ids of the transactions and the transaction counts contributed by the transactions of the incoming bucket. The searching for frequent itemsets is then performed on the in-memory data structures. By incremental updating the in-memory data structures, the FLDS algorithm avoids multiple scans of the data stream. The experiments show that the searching for frequent itemsets by using the frequent lists is very efficient. The experiments show that the FLDS algorithm is much faster than the traditional data mining algorithms for data stream mining.
[1] R.Agawal,T. Imielinski and A. Swami.“Mining Association Rules between Sets of Items in Large Databases,”Proc. Of ACM SIGMOD,pages 207-216,May 1993.
[2] G. Manku and R. Motwani,“Approximate Frequency Counts over Data Streams,”Proc. Of VLDB Conf,. pp.346-357,2002
[3] R.Agrawal and R. Srikant(1995),“Mining sequential patterns,”Proceeding of The International Conference of Data Engineering ,3-14.
[4] J.H.Chang and W.S.Lee,“Finding Recent Frequent Itemsets Adaptively over Online Data Stream,“Proc. Of ACM SIGKDD Conf., pp.487-492 2003.
[5] E.Cohen and M.Strauss,“Maintaining Time Decaying Stream Aggregates,”Proc. Of ACM PODS Symp.,2003.
[6] R.Agrawal and R.Srikant,“Fast Algorithms for Mining Association Rules,”Proc. Of VLDB Journal, 12(2): 120-139, 2003.
[7] A.Arasu snd G.S.Manku,“Approximate Counts and Quantiles over Sliding Windows,”Proc. Of ACM PODS Symp.,2004.
[8] D.Cheung,J.han,V.Ng, and C.Y.Wong,“Maintenance of Discovered Association Rules in Large Database: An Incremental Updating Technique,”Proc. Of ICDE Conf.,1996
[9] G.Cormode and S.Muthukrishnaan,“What’s Hot and What’s not: Tracking Most Frequent items Dynamically,”Proc. Of ACM PODS Symp., pp.296-306,2003
[10] C.Giannella, J.Han, J.Pei, X.Yan, and P.S.Yu,“Mining Frequent Patterns in Data Streams at Multiple Time Granularities,”H. Kargupta, A.Joshi, K.Sivakumar, and Y. Yesha(eds.), Next Generation Data Mining, pp.191-212,2003
[11] J.Han and K.Kamber, Simon Fraser University(2001), Data Mining:Concepts and Techniques,5nd ed., San Francisco: Morgan Kaufmann, pp.230-235
[12] A. Das, W. K. Ng and Y. K. Woon, “Association Rule Mining:Rapid Association Rule Mining,”Proceedings of the tenth international conference on Information and Knowledge Management October 2000
[13] P. Shenoy, J. R. Haritsa, S. Sundarshan, G. Bhalotia, M. Bawa and D. Shah,“Turbo-Charging Vertical Mining of Large Databases,”ACM SIGMOD Record,Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data,May 2000
[14] Yudho Giri Sucahyo, Raj P. Gopalan,“CT-ITL: Efficient Frequent Item Set Mining Using a Compressed Prefix Tree with Pattern Growth,” ACM International Conference Proceeding Series; Vol. 143, Proceedings of the 14th Australasian database conference - Volume 17, pp.95-104, 2003.
[15] 邱士軍(2002),“關聯規則演算法之實作與效能評估”,國立台灣科技大學資訊管理研究所碩士論文。
[16] 王守田(2004),“一個使用高頻串列具有可擴充性的關連規則演算法”,國立台灣科技大學資訊管理研究所碩士論文。
[17] 宋子康(2006),“資料串流中尋找關聯法則之研究”,國立台灣科技大學資訊管理研究所碩士論文。
[18] Yunyue Zhu, Dennis Shasha;StatStream:Statistical Monitoring of Thousands of Data Streams in Real Time;Int’l Conf. on Very Large Data Bases; 2002
[19] Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan; An Efficient Algorithm for Mining Frequent Itemsets over the Entrie History of Data Stream; Int’l Workshop on Knowledge Discovery in Data Stream: Sept. 2004
[20] Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying Zhou; False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Stream; Int’l Conf. on Very Large Databases; 2004
[21] Joong Hyuk Chang, Won Suk Lee;Finding Recently Ferquent Itemsets Adaptively over Online Data Streams; ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining; August 2003
[22] Joong Hyuk Chang, Won Suk Lee; A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Stream; Journal of Information Science and Engineering; July 2004
[23] Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen; Mining Frequent Itemsets from Data Streams with a Time-Sensitive Slidind Window; SIAM Int’l Conf. on Data Stream; April 2005