簡易檢索 / 詳目顯示

研究生: 呂信賢
Hsin-Hsien Lu
論文名稱: P2P網路上混合式預測大項目集探勘的研究
A Hybrid Predictive Maximal Frequent Itemset Mining Algorithm On P2P Network
指導教授: 陳秋華
Chyouhwa Chen
鄧惟中
Wei-Chung Teng
口試委員: 項天瑞
Hsin-Hsien Lu
李育杰
Yuh-Jye Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 58
中文關鍵詞: maximal frequent itemsetpredictiveP2P Networks
外文關鍵詞: Peer to peer networks, maximal frequent itemset
相關次數: 點閱:335下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 文獻中有較多探討是針對並行式最大頻繁項目集探勘的問題,但是很少探討對於分散式最大頻繁項目集探勘的問題。
    本論文提出了一個在分散式P2P網路的基礎上,混合利用Bottom-Up的Apriori-like與Top-Down在過程中做最大頻繁項目集探勘的一個新的演算法。利用頻繁項目集的集合本身作為預測的基礎。該演算法將由Bottom-Up的Apriori-like找到之頻繁項目集視為資料庫,計算支持度來尋找長度最長的最大頻繁項目集,保持目前已知的最大的頻繁項目集,這是極其寶貴的,可以減少在分佈式環境中的Candidates的消息交換。為了最大限度地提高演算法的效能,該演算法在第一階段,提出一個簡單有效的預測方法,幫助在Bottom-Up過程中減少Candidates。
    本論文以模擬器為實驗的基礎,分析比較在分散式系統中求取最大頻繁項目集的方法。經由一系列廣泛的實驗結果表明,我們所提出的策略與演算法是有效的。


    Abstract—We propose a novel algorithm for distributed maximal frequent itemset mining in unstructured peer-to-peer networks based on a bottom-up Apriori-like process. The algorithm exploits both the subset infrequency pruning strategy and superset frequency pruning strategy, by maintaining currently known maximal FI and minimal NFI sets, which are extremely valuable to reduce candidate message exchanges in a distributed environment. To maximize the benefits of the two pruning strategies, the algorithm comprises of two stages. In the first stage, a preliminary set of maximal FIs and a set of minimal NFIs are discovered. The results are used and augmented in the second search procedure to find maximal frequent itemsets. In each stage, a novel predictive candidate generation strategy is embedded in the search to generate long frequent candidate itemsets. Unlike Apriori, which generates candidates of length l +1 from frequent itemsets of length l, our predictive candidate generation strategy tries to generate candidates of length l +k, with k≧1. Correctly predicted long itemsets are used for superset frequency pruning. However, incorrectly predicted itemsets incur wasted verification messages. An extensive set of experiments show that our proposed strategy is effective in candidate prediction, thus requiring significantly less message exchange than traditional approaches.

    誌謝 摘要 Abstract 目錄 圖表目錄 1 Introduction ……………………………………………………………………1 1.1 前言…………………………………………………1 1.2 研究動機……………………………………………4 1.3 分散式網路架構……………………………………10 1.4 現行方法……………………………………………10 1.4.1 集中式系統…………………………11 1.4.2 分散式系統…………………………11 2 相關議題…………………………………………………………………………12 3 演算法……………………………………………………………………………13 3.1 想法…………………………………………………13 3.2 預測Candidates……………………………………15 3.3 演算法架構…………………………………………18 4 演算法範例………………………………………………………………………28 5 實驗效能…………………………………………………………………………41 5.1 環境參數……………………………………………41 5.2 實驗效能評估………………………………………41 6 結語………………………………………………………………………………53 7 未來展望…………………………………………………………………………54 8 參考文獻…………………………………………………………………………55

    1. R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962 . 969, 1996
    2. R. Agrawal, and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” Proceedings of the 20th International Conference on Very Large Data Bases, (VLDB'94), pp. 487-499, Sep. 1994
    3. Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Int'l. Conference on Management of Data, pages 207.216, Washington, D.C., June 1993
    4. R. J. Bayardo Jr. Efficiently mining long patterns from databases. In SIGMOD, 1998.
    5. Y. Birk, L. Liss, A. Schuster, and R. Wolff. A local algorithm for ad hoc majority voting via charge fusion. In Proc. of DISC, October 2004
    6. B. Bloom. Space/time tradeoffs in in hash coding with allowable errors. Communications of the ACM, 13(7):422-426, 1970
    7. Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), "An Improved Construction for Counting Bloom Filters", Algorithms – ESA 2006, 14th Annual European Symposium, 4168, pp. 684–695
    8. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In ACM SIGMOD, 1997
    9. S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE Trans. Info. Theory, vol. 52, no. 6, 2006
    10. Toon Calders, Bart Goethals,.Non-derivable itemset mining.- In: Data mining and knowledge discovery, 14:1(2007), p. 171-206
    11. Toon Calders, Bart Goethals, Depth-first non-derivable itemset mining.- In: Proceedings of the SIAM International Conference on Data Mining, s.l., , 2005, p. 250-261
    12. Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker. Making Gnutella-like P2P Systems Scalable. In SIGCOMM, pages 407–418, 2003
    13. D. Cheung and Y. Xiao. Effect of data skewness in parallel mining of association rules. In 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 48 . 60, April 1998.
    14. D.W. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In Proc. of the Int'l. Conf. on 44, Miami Beach, Florida, December 1996
    15. Laukik Chitnis, Alin Dobra, Sanjay Ranka: Fault tolerant aggregation in heterogeneous sensor networks, Journal of Parallel and Distributed Computing 69(2):(Feb 2009) 210--219, 10
    16. Laukik Chitnis, Alin Dobra, Sanjay Ranka: Aggregation methods for large-scale sensor networks. IEEE Transactions of Sensor Networks 4(2):(2008)
    17. Laukik Chitnis, Alin Dobra, Sanjay Ranka. Analyzing the multiple aggregation trees technique for fault tolerance in sensor networks. In proceedings of International Conference on Information Systems, Technology and Management (ICISTM 2007), New Delhi, India, March 2007. pg. 269-279.
    18. Soon Myoung Chung, Congnan Luo, Efficient mining of maximal frequent itemsets from databases on a cluster of workstations. Knowledge and Information Systems, 16(3):359–391, (2008)
    19. Demers, A.J., Greene, D.H., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H.E., Swinehart, D.C., Terry, D.B.: Epidemic algorithms for replicated database maintenance. In: Proceedings of the Principles of Distibuted Computing (PODC), pp. 1–12 (1987)
    20. L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable widearea Web cache sharing protocol. IEEE/ACM Trans. on Networking, 8(3):281-293, 2000
    21. Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for association rules. IEEE Transactions on Knowledge and Data Engineering, 12(3):352 . 377, 2000
    22. J. Han, and M. Kamber, Data Mining Concepts and Techniques, San Francisco, Morgan Kaufmann, 2001
    23. J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1-12, May 2000
    24. Gian Paolo Jesi, Alberto Montresor, and Ozalp Babaoglu, Proximity-aware superpeer overlay topologies, IEEE Transactions on Network and Service Management, 4(2):74-83, September 2007
    25. Lujun Jia, Guevara Noubir, Rajmohan Rajaraman, Ravi Sundaram, Group-Independent Spanning Trees for Data Aggregation in Sensor Networks, IEEE International Conference on Distributed Computing in Sensor Systems, June 2006
    26. Michael Kleis, Eng Keong Lua, Hierarchical Peer-to-Peer Networks Using Lightweight SuperPeer Topologies, 10th IEEE Symposium on Computers and Communications (IEEE ISCC'05), Cartagena, Spain, June 27-30, 2005, pp. 143-148
    27. Denis Krivitski, Assaf Schuster, Ran Wolff, “A Local Facility Location Algorithm for Large-scale Distributed Systems,” Journal of Grid Computing, 5(4): 361-378, 2007
    28. P Luo, H Xiong, K Lü, Z Shi, “Distributed classification in peer-to-peer networks,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007
    29. A. Montresor, “A robust protocol for building superpeer overlay topologies,” in Proc. of the 4th Int. Conf. on Peer-to-Peer Computing. Zurich, Switzerland: IEEE, Aug. 2004
    30. Osmar R. Zaiane, Mohammad El-Hajj, and Paul Lu. Fast parallel association rules mining without candidacy generation. In IEEE 2001 International Conference on Data Mining, pages 665.668, 2001
    31. A. Schuster and R. Wolff. Communication-efficient distributed mining of association rules. In Proc. of the ACM SIGMOD Int'l. Conference on Management of Data, pages 473 . 484, Santa Barbara, California, May 2001.
    32. Ran Wolff and Assaf Schuster, “Association Rule Mining in Peer-to-Peer Systems,” IEEE Transactions on Systems, Man, and Cybernetics, part B, 34(6), 2004
    33. B. Yang and H. Garcia-Molina, “Designing a super-peer network,” in Proc. IEEE International Conference on Data Engineering (ICDE’03),2003
    34. Li Xiao, Zhenyun Zhuang, and Yunhao Liu, "Dynamic Layer Management in Superpeer Architectures", IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 11, November 2005, Pages 1078-1091
    35. J. Xu Yu, Z. Chong, H. J. Lu and A. Zhou. False Positive or False Negative: Mining Frequent Itemsets form High Speed Transactional Data Streams. In Proc. of 30th Intl. Conf. on Very Large Data Bases, pages 204-215, 2004
    36. G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Cambridge, Massachusetts, 1949.
    37. IBM Quest Synthetic Data Generator.(2010) – IBM quest data mining synthetic data generation code.http://ibmquestdatagen.sourceforge.net/
    38. http://www.almaden.ibm.com/software/quest/Resources/datasets/synd
    ata.html
    39. http://fimi.cs.helsinki.fi/data/
    40. http://www.ics.uci.edu/~mlearn

    無法下載圖示 全文公開日期 2017/07/31 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE