Basic Search / Detailed Display

Author: 林楊澤
Yang-Tze Lin
Thesis Title: 在決策樹模型下利用交換技術隱藏準標識符達到保護隱私的效果
Hiding Quasi-identifier on Decision Tree utilizing Swapping Technique for Preserving Privacy
Advisor: 戴碧如
Bi-Ru Dai
Committee: 楊得年
De-Nian Yang
吳怡樂
Yi-Leh Wu
陳建錦
Chien-Chin Chen
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2009
Graduation Academic Year: 97
Language: 英文
Pages: 56
Keywords (in Chinese): 分類隱私的保護決策樹標準識符
Keywords (in other languages): Classification, preserving privacy, decision tree, quasi-identifier
Reference times: Clicks: 216Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 分類技術在探勘領域是一種非常重要的議題,並且決策樹在分類技術上更是一種常被應用的方法。有些資料來源包含了個人的隱私資料,並且這些個人的隱私資料並不是人們願意被公開的。當個人的隱私資料被公開或者被利用,這種情況可能危及到人們的隱私權或者其他的權益,所以應該在資料被釋放出之前就必須採取相對應的保護措施。然而,現存的技術在隱藏個人的隱私資料時是直接修改原始的資料內容,這樣的技術雖然可以有效的保護個人的隱私資料,但是在修改過程中並沒有考慮分類模型的預測正確度,這會使得修改後的資料無法被有效地利用在分類技術上。在我們的研究當中,我們提出了一個演算法保護個人的隱私資料,並且在隱藏的過程當中考慮決策樹的預測正確性。我們的目標是透過我們所提出的演算法保護所有的個人的隱私資料,並且能夠有效地降低資料的破壞程度。此外在考慮如何保護個人的隱私資料的同時,我們也會有效地維持決策樹的預測正確度。在實驗中,我們會驗證我們所提的演算法是可以成功地隱藏所有的個人隱私資料和維持決策樹的預測正確度。


    Classification is an important issue in data mining, and decision tree is one of the most popular techniques for classification analysis. Some data sources contain private personal information that people are unwilling to reveal. The disclosure of person-specific data is possible to endanger thousands of people, and therefore the dataset should be protected before it is released for mining. However, techniques to hide private information usually modify the original dataset without considering influences on the prediction accuracy of a classification model. In this research, we propose an algorithm to protect personal privacy for classification model based on decision tree. Our goal is to hide all person-specific information with minimized data perturbation. Furthermore, the prediction capability of the decision tree classifier can be maintained. As demonstrated in the experiments, the proposed algorithm can successfully hide private information with fewer disturbances of the classifier.

    指導教授推薦書 II 論文口試委員審定書 III 摘 要 IV ABSTRACT V LIST OF TABLES VIII LIST OF FIGURES IX CHAPTER 1. INTRODUCTION 1 1.1 CONTRIBUTION 4 1.2 THE ORGANIZATION OF THIS THESIS 5 CHAPTER 2. RELATED WORK 6 2.1 DATA MINING 6 2.2 PRIVACY 7 2.2.1 Data Perturbation 7 2.2.2 Generalization 9 2.2.3 Data Swapping 9 2.2.4 Horizontally and Vertically Partition 9 CHAPTER 3. PROBLEM STATEMENT 11 3.1 STATEMENT 1 (QUASI-IDENTIFIER) 11 3.2 STATEMENT 2 (SUCCESSFUL HIDING) 12 3.3 STATEMENT 3 (PERTURBATION OF THE DECISION TREE) 13 CHAPTER 4. THE FRAMEWORK OF EFFICIENT PROTECTION WITH LESS PERTURBATION (EPLP) 14 4.1 DISCRETIZATION FUNCTION 15 4.2 SIMILAR ATTRIBUTE COMBINATION AROUND THE UNIQUE TUPLE (SACUT) 19 4.3 THE EFFICIENT PROTECTION WITH LESS PERTURBATION (EPLP) ALGORITHM 23 4.3.1 Hiding All Quasi-identifiers (EPLP-HAQI) Algorithm 26 4.3.2 One Perturbation with More Hiding (EPLP-OPMH) algorithm 30 4.3.3 Efficiently Maintaining Prediction Accuracy (EPLP-EMPA) algorithm 34 4.3.2 The Integrated EPLP Algorithm 35 CHAPTER 5. EXPERIMENTAL RESULTS 39 5.1 DATASET 40 5.2 PERFORMANCE EVALUATION 41 5.2.1 Comparing the features of algorithms EPLP-HAQI, and EPLP-OPMH and EPLP-EMPA 42 5.2.2 Comparing performance between the integrated EPLP and other algorithms 44 CHAPTER 6. CONCLUSION 52 REFERENCES 53

    [1] M.-S. Chen, J. Han, P.S. Yu, “Data mining: An overview from a database perspective,” In IEEE Transactions on Knowledge and Data Engineering, 8 (6), 1996, pp. 866-883.
    [2] J. Li, R. C.-W. Wong, Ada W.-C Fu and J. Pei, “Achieving k- Anonymity by Clustering in Attribute Hierarchical Structures”, in proceedings of the Data Warehousing and Knowledge Discovery, 2006, 405-416.
    [3] J. Byun, A. Kamra, E. Bertino, N. Li, “Efficient k-Anonymization using clustering techniques”. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, 2007, pp. 188–200. Springer, Heidelberg.
    [4] B.C.M. Fung, K. Wang, P.S. Yu, “Anonymizing classification data for privacy preservation”. in IEEE Transactions on Knowledge and Data Engineering, 19 (5), 2007, pp. 711-725.
    [5] J.R. Quinlan, “Induction of decision trees”. Machine Learning 1, 1986, 81-106.
    [6] Xiao, M., Huang, L., Shen, H., Luo, Y, “Privacy preserving ID3 algorithm over horizontally partitioned data”(2005) Parallel and Distributed Computing, Applications and Technologies, PDCAT Proceedings, 2005, art. no. 1578905, pp. 239-243.
    [7] J. Vaidya and C. Clifton. “Privacy-Preserving Decision Trees over Vertically Partitioned Data,” in Data and Application Security (DBSec), pages 139–152, 2005.
    [8] J. Byun, A. Kamra, E. Bertino, N. Li, “Efficient k-Anonymization using clustering techniques”. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, 2007, pp. 188–200. Springer, Heidelberg.
    [9] C. C. Aggarwal. “On k-anonymity and the curse of dimensionality,” in VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 901–909. VLDB Endowment, 2005.
    [10] D. Agrawal and C. C. Aggarwal. “On the design and quantification of privacy preserving data mining algorithms”. In PODS ’01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 247–255, New York, NY, USA, 2001. ACM Press.
    [11] J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2006.
    [10] R. Agrawal, T. Imielinski, and A. N. Swami, “Mining Association Rules between Sets of Items in Large Databases,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (P. Buneman and S. Jajodia, eds.), vol. 22, (Washington, D.C.), pp. 207–216, ACM, May 1993.
    [11] J. A. Hartigan. “Clustering Algorithms”, John Wiley & Sons, 1975.
    [12] A. K. Jain and R. C. Dubes. “Algorithms for Clustering Data”, Prentice Hall, 1988.
    [13] S. P. Lloyd. Least Squares Quantization in PCM. IEEE Trans. Information Theory, 28:128-137, 1982.
    [14] J. MacQueen. “Some methods for classification and analysis of multivariate observations”, Proc. 5th Berkeley Symp. Math. Statist. Prob., 1:281-297, 1967.
    [15] Liu, K., Kargupta, H., Ryan, J.Random, “Projection-based multiplicative data perturbation for privacy preserving distributed data mining”, (2006) IEEE Transactions on Knowledge and Data Engineering, 18 (1), pp. 92-106. Cited 35 times.doi: 10.1109/TKDE.2006.14
    [16] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l J. Uncertainty, Fuzziness, and Knowledge-Based Systems, vol. 10, no. 5, pp. 557-570, 2002.
    [17] C.K. Liew, U.J. Choi, and C.J. Liew, “A Data Distortion by Probability Distribution,” ACM Trans. Database Systems (TODS), vol. 10, no. 3, pp. 395-411, 1985.
    [18] E. Lefons, A. Silvestri, and F. Tangorra, “An Analytic Approach to Statistical Databases,” Proc. Ninth Int’l Conf. Very Large Data Bases, pp. 260-274, Nov. 1983.
    [19] N.R. Adam and J.C. Worthmann, “Security-Control Methods for Statistical Databases: A Comparative Study,” ACM Computing Surveys (CSUR), vol. 21, no. 4, pp. 515-556, 1989.
    [20] M. Z. Islam, and L. Brankovic,” Noise Addition for Protecting Privacy in Data Mining,” in Proceedings of The 6th Engineering Mathematics and Applications Conference (EMAC2003), Sydney, 85-90.
    [21] M.Z.Islam, P.M.Barnaghi and L. Brankovic, “Measuring Data Quality: Predictive Accuracy vs. Similarity of Decision Trees,” in Proceedings ofthe 6e" Intenational Conference on Computer & Information Technology (ICCIT 2003), 2003, Dhaka, Bangladesh, Vol. 2, 457-462.
    [22] M.Z.Islam and L. Brankovic, “A Framework for Privacy Preserving Classification in Data Mining,” in Proceedings of Australasian Workshop on Data Mining and Web Intelligence (DMWI 2004), Dunedin, New Zealand, CRPIT, 32, J. Hogan, P. Montague, M. Purvis and C. Steketee, Eds., Australasian Compuiter Science Communications, 163-168.
    [23] Md.Z. Islam, L. Brankovic, “DETECTIVE: A decision tree based categorical value clustering and perturbation technique for preserving privacy in data mining,” in 2005 3rd IEEE International Conference on Industrial Informatics, INDIN 2005, art. no. 1560461, pp. 701-708.
    [24] X.-D. Wu, D.-M. Yue, F.-L. Liu, Y.-F. Wang, C.-H. Chu, “Privacy preserving data mining algorithms by data distortion,” in Proceedings of 2006 International Conference on Management Science and Engineering, ICMSE'06 (13th), art. no. 4104898, 2007, pp. 223-228.
    [25] T. Dalenius and S.P. Reiss, “Data-Swapping: A Technique for Disclosure Control,” J. Statistical Planning and Inference, vol. 6, pp. 73-85, 1982.
    2.2.4 Horizontally and Vertically Partition
    [26] W. Du and Z. Zhan, "Building Decision Tree Classifier on Private Data", In CRPITS’14: Proceedings of the IEEE international conference on Privacy, security and data mining, pages 1–8, Darlinghurst, Australia, Australia, 2002. Australian Computer Society, Inc.
    [27] V. Yan Fu Tan, S.-K. Ng, “Privacy-preserving sharing of horizontallydistributed private data for constructing accurate classifiers,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4890 LNCS, pp. 116-137.
    [28] M.-J. Xiao, K. Han, L.-S. Huang and J.-Y. Li, “Privacy Preserving C4.5 Algorithm Over Horizontally Partitioned Data” (2006) Proceedings - Fifth International Conference on Grid and Cooperative Computing, GCC 2006, art. no. 4031437, pp. 78-85
    [29] Samet, S., Miri, A., “Privacy preserving ID3 using gini index over horizontally partitioned data “ (2008) AICCSA 08 - 6th IEEE/ACS International Conference on Computer Systems and Applications, art. no. 4493598, pp. 645-651
    [30] M.-J. Xiao, L.-S. Huang, Y.-L. Luo, and H. Shen, “Privacy Preserving ID3 Algorithm over Horizontally Partitioned Data,” in Parallel and istributed Computing, Applications and Technologies, pages 239–243, 2005. 651.
    [31] J. Zhan, “Using homomorphic encryption for privacy-preserving collaborative decision tree classification,” in Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2007, art. no. 4221360, pp. 637-645.
    [32] J. F. Kenney and E. S. Keeping, "Frequency Distributions." §1.8 in Mathematics of Statistics, Pt. 1, 3rd ed. Princeton, NJ: Van Nostrand, pp. 12-19, 1962.
    [33] P. Legendre and L. Legendre, (1998) Numerical Ecology, 2nd English edition. Elsevier, Amsterdam.
    [34] A. Asuncion and D. Newman, “UCI machine learning repository [http://www.ics.uci.edu/∼mlearn/MLRepository.html],” 2008.

    無法下載圖示 Full text public date 2014/07/27 (Intranet public)
    Full text public date This full text is not authorized to be published. (Internet public)
    Full text public date This full text is not authorized to be published. (National library)
    QR CODE