研究生: |
謝丹齡 Dan-Ling Hsieh |
---|---|
論文名稱: |
基於K-prototype與貝氏分類器填補混合型次序尺量缺失資料 Data Imputation for Mixture Data Type with Ordinal Scale by K-prototype and Naïve Bayes Classifier |
指導教授: |
呂永和
Yung-Ho Leu |
口試委員: |
楊維寧
Wei-Ning Yang 陳雲岫 Yun-Shiow Chen |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 31 |
中文關鍵詞: | 資料填補 、K-prototype Algorithm 、貝氏分類器 |
外文關鍵詞: | Data Imputation, K-prototype Algorithm, Naïve Bayes Classifier |
相關次數: | 點閱:182 下載:7 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
如今,資料科學變得越來越流行,因為資料科學有如此多的應用。例如,精準醫療、自動駕駛和推薦系統都需要使用。此外,在許多應用中總是使用大量模型進行良好的預測。通常,模型越大,我們構建模型所需的數據就越多。但是,由於某些原因,數據內容可能不完整,這需要靠資料補值以填補不完整數據中的缺失值。
在本論文中,我們首先使用 K-prototype 算法將數據完整的資料劃分為幾個集群。在填補不完整數據的缺失值時,我們首先使用貝氏分類器根據遺失數據的非缺失值部分預測其聚類標籤。然後,我們根據其對應的完整數據群集填補不完整數據的缺失值。根據不同的資料及使用不同的填補方法,該方法在批發客戶數據集上優於現有方法,並且與GFCMI方法具有可比性。我們的估算數據集和原始數據集之間的準確度差異在 3% 以內。
Nowadays, data science becomes more and more popular because there are so many applications for data science. For example, precision medicine, autopilot, and recommendation systems all require using data science techniques. In addition, there are always large models for making good predictions in many applications. Usually, the larger the model, the more data we need for constructing the models. However, data examples may be incomplete for some reasons, which necessitates data imputation to fill in the missing values in an incomplete data example.
In this thesis, we first used the K-prototype algorithm to divide the complete data examples in the dataset into several clusters. In filling in the missing value of an incomplete data example, we first used the Naïve Bayes classifier to predict the cluster label of the data example based on its non-missing values. Then, we filled in the missing values of the incomplete data example according to its corresponding cluster of complete data examples. According to classification on the imputed datasets by different data imputation methods, the proposed method outperformed the existing methods on the Wholesale Customers data set and had a comparable result as the GFCMI method. The difference in accuracies between our imputed dataset and the original dataset is within 3 percent.
[1] Martin David, Roderick J. A. Little, Michael E. Samuhel, Robert K. Triest. (1986). Alternative Methods for CPS Income Imputation. Journal of the American Statistical Association, 81(393), 29-41.
[2] A. P. Dempster, N. M. Laird, D. B. Rubin. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society,39(1),1-38.
[3] Marco Di Zio, Ugo Guarnera, Orietta Luzi. (2007). Imputation through finite Gaussian mixture models. Computational Statistics & Data Analysis,51(11),5305-5316
[4] Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons Inc., New York.
[5] Julie Josse, Jérôme Pagès, François Husson. (2011). Multiple imputation in principal component analysis. Advances in Data Analysis and Classification, 5(3), 231–246
[6] Evelyn Fix, J. L. Hodges, Jr. (1989).Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review / Revue Internationale de Statistique, 57(3), 238-247.
[7] Lee, D, Seung, H. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788-791.
[8] Kolda, T. G, Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3), 455-500.
[9] Jinsung Yoon, James Jordon, Mihaela van der Schaar. (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets. arXiv:1806.02920.
[10] Françoise Fessant, hie Midenet. (2002). Self-Organising Map for Data Imputation and Correction in Surveys. Neural Computing and Applications, 10(4), 300-310.
[11] Yoshikazu Fujikawa, TuBao Ho. (2002). Cluster-Based Algorithms for Dealing with Missing Values. Advances in Knowledge Discovery and Data Mining, 549-554.
[12] Chao Jiang, Zijiang Yang. (2015). CKNNI: An improved knn-based missing value handling technique. International Conference on Intelligent Computing.
[13] Md. Geaur Rahman, Md Zahidul Islam. (2015) Missing value imputation using a fuzzy clustering-based EM approach. Knowledge and Information Systems, 46, 389–422
[14] Amir Masoud Sefidian, Negin Daneshpour. (2019). Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Systems With Applications, 115, 68-94.
[15] Eduardo R Hruschka. (2005). Naive Bayes as an Imputation Tool for Classification Problems. 5th International Conference on Hybrid Intelligent Systems.
[16] B K Khotimah, Miswanto, H Suprajitno. (2019). Modeling naïve bayes imputation classification for missing data. IOP Conf. Ser.: Earth Environ. Sci. 243 012111
[17] Zhexue Huang. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283–304.
[18] Abreu, N. (2011). Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon.
[19] Robert L. Thorndike. (1953). Who Belong in the Family. Psychometrika, 18(4), 267-276.