簡易檢索 / 詳目顯示

研究生: 謝丹齡
Dan-Ling Hsieh
論文名稱: 基於K-prototype與貝氏分類器填補混合型次序尺量缺失資料
Data Imputation for Mixture Data Type with Ordinal Scale by K-prototype and Naïve Bayes Classifier
指導教授: 呂永和
Yung-Ho Leu
口試委員: 楊維寧
Wei-Ning Yang
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 31
中文關鍵詞: 資料填補K-prototype Algorithm貝氏分類器
外文關鍵詞: Data Imputation, K-prototype Algorithm, Naïve Bayes Classifier
相關次數: 點閱:179下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

如今,資料科學變得越來越流行,因為資料科學有如此多的應用。例如,精準醫療、自動駕駛和推薦系統都需要使用。此外,在許多應用中總是使用大量模型進行良好的預測。通常,模型越大,我們構建模型所需的數據就越多。但是,由於某些原因,數據內容可能不完整,這需要靠資料補值以填補不完整數據中的缺失值。
在本論文中,我們首先使用 K-prototype 算法將數據完整的資料劃分為幾個集群。在填補不完整數據的缺失值時,我們首先使用貝氏分類器根據遺失數據的非缺失值部分預測其聚類標籤。然後,我們根據其對應的完整數據群集填補不完整數據的缺失值。根據不同的資料及使用不同的填補方法,該方法在批發客戶數據集上優於現有方法,並且與GFCMI方法具有可比性。我們的估算數據集和原始數據集之間的準確度差異在 3% 以內。


Nowadays, data science becomes more and more popular because there are so many applications for data science. For example, precision medicine, autopilot, and recommendation systems all require using data science techniques. In addition, there are always large models for making good predictions in many applications. Usually, the larger the model, the more data we need for constructing the models. However, data examples may be incomplete for some reasons, which necessitates data imputation to fill in the missing values in an incomplete data example.
In this thesis, we first used the K-prototype algorithm to divide the complete data examples in the dataset into several clusters. In filling in the missing value of an incomplete data example, we first used the Naïve Bayes classifier to predict the cluster label of the data example based on its non-missing values. Then, we filled in the missing values of the incomplete data example according to its corresponding cluster of complete data examples. According to classification on the imputed datasets by different data imputation methods, the proposed method outperformed the existing methods on the Wholesale Customers data set and had a comparable result as the GFCMI method. The difference in accuracies between our imputed dataset and the original dataset is within 3 percent.

ABSTRACT i ACKNOWLEDGEMENT ii TABLE OF CONTENTS iii LIST OF FIGURES v LIST OF TABLES vi Chapter 1. Introduction 1 1.1. Background 1 1.2. Motivation 1 1.3. Purpose 2 Chapter 2. Related Work 3 2.1. Data Imputation 3 2.2. Cluster Analysis 4 2.2.1. K-means 4 2.2.2. K-prototypes 7 2.3. Bayes Theorem 9 2.3.1. Conditional Independence 9 2.3.2. Naïve Bayes Classifier 10 Chapter 3. Proposed Approach 13 3.1. Overview 13 3.2. Data Preprocessing 14 3.2.1. Wholesale Customers Data Set 14 3.2.2. Wine Data Set 16 3.3. Cluster analysis 17 3.4. Data imputation 17 3.5. Evaluation Metric 19 Chapter 4. Experimental Result 20 4.1. Experimental Environment 20 4.2. Dataset Description 21 4.2.1. Wholesale Customers data set 21 4.2.2. Wine data set 22 4.3. Experimental Result of Wholesale Customers Data Set 23 4.3.1. Parameters 23 4.3.2. Evaluation of Prediction Models 24 4.3.3. Performance Comparisons 24 4.4. Experimental Result of Wine Data Set 25 4.4.1. Parameters 25 4.4.2. Evaluation of Prediction Models 26 4.4.3. Performance Comparisons 26 4.5. Discussion 27 Chapter 5. Conclusions and Future Research 28 5.1. Conclusions 28 5.2. Future Research 28 REFERENCES 30

[1] Martin David, Roderick J. A. Little, Michael E. Samuhel, Robert K. Triest. (1986). Alternative Methods for CPS Income Imputation. Journal of the American Statistical Association, 81(393), 29-41.

[2] A. P. Dempster, N. M. Laird, D. B. Rubin. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society,39(1),1-38.

[3] Marco Di Zio, Ugo Guarnera, Orietta Luzi. (2007). Imputation through finite Gaussian mixture models. Computational Statistics & Data Analysis,51(11),5305-5316

[4] Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons Inc., New York.

[5] Julie Josse, Jérôme Pagès, François Husson. (2011). Multiple imputation in principal component analysis. Advances in Data Analysis and Classification, 5(3), 231–246

[6] Evelyn Fix, J. L. Hodges, Jr. (1989).Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review / Revue Internationale de Statistique, 57(3), 238-247.

[7] Lee, D, Seung, H. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788-791.

[8] Kolda, T. G, Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3), 455-500.

[9] Jinsung Yoon, James Jordon, Mihaela van der Schaar. (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets. arXiv:1806.02920.

[10] Françoise Fessant, hie Midenet. (2002). Self-Organising Map for Data Imputation and Correction in Surveys. Neural Computing and Applications, 10(4), 300-310.

[11] Yoshikazu Fujikawa, TuBao Ho. (2002). Cluster-Based Algorithms for Dealing with Missing Values. Advances in Knowledge Discovery and Data Mining, 549-554.

[12] Chao Jiang, Zijiang Yang. (2015). CKNNI: An improved knn-based missing value handling technique. International Conference on Intelligent Computing.

[13] Md. Geaur Rahman, Md Zahidul Islam. (2015) Missing value imputation using a fuzzy clustering-based EM approach. Knowledge and Information Systems, 46, 389–422

[14] Amir Masoud Sefidian, Negin Daneshpour. (2019). Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Systems With Applications, 115, 68-94.

[15] Eduardo R Hruschka. (2005). Naive Bayes as an Imputation Tool for Classification Problems. 5th International Conference on Hybrid Intelligent Systems.

[16] B K Khotimah, Miswanto, H Suprajitno. (2019). Modeling naïve bayes imputation classification for missing data. IOP Conf. Ser.: Earth Environ. Sci. 243 012111

[17] Zhexue Huang. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283–304.

[18] Abreu, N. (2011). Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon.

[19] Robert L. Thorndike. (1953). Who Belong in the Family. Psychometrika, 18(4), 267-276.

QR CODE