簡易檢索 / 詳目顯示

研究生: 黃金評
Chin-Ping Huang
論文名稱: 應用Locality Sensitive Hashing以改善關聯規則探勘之效率
Locality Sensitive Hashing for Sampling-based Algorithms in Association Rule Mining
指導教授: 陳秋華
Chyou-Hwa Chen
口試委員: 鄧惟中
none
邱舉明
none
戴碧如
none
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 中文
論文頁數: 45
中文關鍵詞: 資料探勘關聯規則LSH
外文關鍵詞: Data mining, Association rule, LSH
相關次數: 點閱:349下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

關聯規則的挖掘是目前最重要的技術之一,應用於智慧型系統的設計並己被廣泛的施行於大量的實際應用中。然而,當資料庫越來越龐大,傳統的演算法無法在可接受的時間內完成,利用抽樣的方式來處理資料庫的子集是一種可行的方法之一,但是只使用資料庫的子集因為抽樣概率的屬性與原始的資料庫仍有差距存在無法達到很精準的規則抽取,改善準確性(Accuracy)方法之一是利用從初始的樣品中移除離群值(outlier)以獲得較相似於原始的龐大資料庫的精簡的資料集來代表。
在本篇論文中, 我們針對初步的抽樣資料做分群(clustering),再於各分群的局部資料中執行離群值(outlier)的移除,如此離群值的移除是基於各獨立的分群中而不是在整個抽樣的資料集中,然而針對高維度交易資料做分群本身即是很困難的問題,為解決這個問題,我們利用LSH來做切割初始抽樣資料成多個重疊分群集合,提出LSH-Basic/LSH-Trim機制來改善效率。經由各種實驗數據發現,我們提出的方法可在準確度或執行時間表現出比之前的演算法更好的效率。


Association rule mining is one of the most important techniques for intelligent system design and has been widely applied in a vast number of real applications. However, when the database is too large, classical algorithms cannot reasonably complete in time. The sampling approach that processes a subset of the database is a viable alternative. However, using only a subset of the database makes it impossible to achieve perfectly accurate rule extraction because the statistical property of the sample is most certainly different from that of the original database. One approach to improve the accuracy is to remove “outliers” from an initial simple random sample to obtain a final sample whose statistical property is more similar to that of the original database.
In this paper, we propose to apply data clustering techniques to the initial sample before outlier removal is performed on the resulting clusters, so that outliers are removed based on the statistical properties in individual clusters, instead of the initial sample as a whole. However, clustering transactional data with very high dimensions is a difficult problem by itself. We solve this problem by employing locality sensitive hashing (LSH) as a means for data clustering. LSH is used to divide the initial sample into multiple possibly overlapping set of clusters. we evaluate several algorithms based on this strategy, and find that our proposals exhibit better accuracy or execution time, or both, than previous algorithms

Table of Contents 摘要I ABSTRACTII 誌謝III TABLE OF CONTENTSIV LIST OF FIGURESVI 1INTRODUCTION1 1.1MOTIVATION1 2RELATED WORKS4 2.1ASSOCIATION RULES DEFINITION4 2.2APRIORI ALGORITHM5 2.3SAMPLING BASED MINING ALGORITHM8 2.4TWO PHASE SAMPLING BASED ALGORITHM9 2.5GENERAL DISTANCE BASED OUTLIER DETECTION10 2.6HIGH DIMENSIONAL DATA OBJECTS SIMILARITY SEARCH USING LSH11 3LSH-BASED SAMPLING ALGORITHM FOR ARM13 3.1LSH-BASIC ALGORITHM15 3.2LSH-TRIM ALGORITHM17 4PERFORMANCE EVALUATION22 4.1SIMULATION ENVIRONMENT22 4.2ACCURACY PERFORMANCE OF THE PROPOSED STRATEGIES23 4.3IMPACT OF OVERLAPPING CLUSTERING SCHEME (|L|=1, 5, 10, 15)28 4.4IMPACT OF THE NUMBER OF CLUSTERS31 4.5IMPACT OF THRESHOLD-BASED FILTERING32 4.6IMPACT OF DATABASE SIZE34 4.7IMPACT OF THE DATA DIMENSIONALITY34 5DISCUSSIONS AND CONCLUSIONS36 5.1CONCLUSION36 5.2FUTURE WORKS36 REFERENCES37

1.C.C. Aggarwal and S.Y. Philip, An effective and efficient algorithm for high-dimensional outlier detection, The VLDB Journal, 14 (2005), pp. 211–221
2.R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the Conference on VLDB, 1994, pp. 487–499
3.Huseyin Akcan , Alex Astashyn , Herve Bronnimann, “Deterministic algorithms for sampling count data,” Data & Knowledge Engineering, v.64 n.2, p.405-418, February, 2008
4.H. Bronnimann, B. Chen, M. Dash, P.J. Haas and P. Scheuermann, “Efficient data reduction with EASE,” Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003
5.Bin Chen , Peter Haas , Peter Scheuermann, A new two-phase sampling based algorithm for discovering association rules, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
6.Yen-Liang Chen , Lucas Tzu-Hsuan Hung, Using decision trees to summarize associative classification rules, Expert Systems with Applications: An International Journal, v.36 n.2, p.2338-2351, March, 2009
7.Duke Hyun Choi, Byeong Seok Ahn, Soung Hie Kim, Prioritization of association rules in data mining: Multiple criteria decision approach, Expert Systems with Applications, Volume 29, Issue 4, November 2005, Pages 867-878
8.Davies, D.L., Bouldin, D.W. (1979), A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, 224-227.
9.Amol Ghoting, Srinivasan Parthasarathy and Matthew Eric Otey, Fast mining of distance-based outliers in high-dimensional datasets, Data Mining and Knowledge Discovery, Volume 16, Number 3, 2008
10.Guha, R. Rastogi and K. Shim, ROCK:A robust clustering algorithm for categorical attributes, Information Systems 25 (2000), pp. 345–366
11.J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: Proceedings of the Conference ACMSIGMOD, 2000, pp. 1–12
12.Yu Xiao Hua, Bin Feng Dan, Bo Yu, Bounded LSH for Similarity Search in Peer-to-Peer File Systems, ICPP '08. 37th International Conference on Parallel Processing, 2008.
13.Kuo, R.J., Lin, S.Y., and Shih, C.W., “Mining association rules through integration of clustering analysis and ant colony system for health insurance database in Taiwan,” Expert Systems with Applications, 33(3), pp.794-808, November 2007
14.Duen-Ren Liu , Meng-Jung Shih , Churn-Jung Liau , Chin-Hui Lai, Mining the change of event trends for decision support in environmental scanning, Expert Systems with Applications, v.36 n.2, p.972-984, March, 2009
15.Srinivasan Parthasarathy, Efficient Progressive Sampling for Association Rules, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM'02), p.354, December 09-12, 2002
16.P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining" by Addison Wesley, 2006
17.Toivonen, H. (1996), Sampling large databases for association rules, in The VLDB Journa', pp. 134-145.
18.R. Weber, H.J. Schek, and S. Blott, Quantitative Analysis and Performance Study for Similarity-. Search Methods in High-Dimensional Space, Proceedings of the 24th International Conference on Very Large Data Bases, 1998
19.Mao Ye , Xue Li , Maria E. Orlowska, Projected outlier detection in high-dimensional mixed-attributes data set, Expert Systems with Applications, v.36 n.3, p.7104-7113, April, 2009
20.J. Zaki, S. Parthasarathy, W. Lin, and M. Ogihara. Evaluation of sampling for data mining of association rules. Proceedings Seventh International Workshop on Research Issues in Data Engineering, 1997.
21.IBM almaden-quest data mining synthetic data generation code. <http://www.almaden.ibm.com/software/quest/>

無法下載圖示
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE