研究生: |
吳元彰 Yuan-chang Wu |
---|---|
論文名稱: |
應用近似鄰居方法與灰關聯分析於資料庫中遺失值填補問題 Applying nearest neighbors approach and grey analysis to missing values completion in database |
指導教授: |
楊鍵樵
Chen-Chau Yang |
口試委員: |
陳省隆
Hsing-Lung Chen 鍾聖倫 Sheng-Luen Chung 陳振楠 Jenn-Nan Chen 朱雨其 Yu-Chi Chu |
學位類別: |
博士 Doctor |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 中文 |
論文頁數: | 83 |
中文關鍵詞: | 遺失值 、近似鄰居插補法 、相似 、灰關聯度 、基因表現資料 |
外文關鍵詞: | missing values, nearest neighbors imputation, similarity, grey relation, gene expression data |
相關次數: | 點閱:282 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究係針對近似鄰居插補法處理不同性質資料集時所衍生出的定義相似度、執行效能與完整資料集選取三方面問題,提出對應的解決方法。對於屬性間關聯程度差異性大的資料集,本文改良灰關聯預測遺失值方法,提出加權式灰關聯度作為其相似度計算方式,且考量其他屬性與含遺失值屬性的關聯性,並依此定義屬性的權重因子。為了減少近似鄰居插補法在大型資料集中搜尋近似紀錄所花費的時間,本研究運用候選資料集取代完整資料集作為加權式灰關聯法搜尋近似資料之集合以節省資料比對次數。為了解決基因表現資料集因屬性遺失率高導致完整紀錄過少的問題,本研究提出了二階段填補遺失值的作法,首先對於含遺失值的紀錄先做第一次的填補,以得到一個完整不含遺失值的資料集,然後再利用此完整資料集對每一個遺失值重新作一次填補。根據實驗結果顯示,本論文所提出的三個遺失值填補方法確實可有效解決這三方面的問題。
The main thesis of this article is to resolve three problems (similarity measurement, performance and complete data collection) occurring in the process of applying nearest neighbors imputation in different datasets. First, for datasets that have large different attribute relations, the grey relational analysis for missing values prediction is modified by adding weighting factors of attributes and the weighted grey relational analysis is proposed to calculate the similarity between records. Second, when calculating similarity between records in the large dataset, the candidate set is used to reduce the number of comparisons in complete set. Third, two-stage missing values completion approach is proposed to resolve the problem that few complete records collected in the gene expression data set. In the first stage, complete source dataset is produced by completing missing values. In second stage, every missing value is completed again based on complete source dataset. Experimental results show that our three approaches for missing values completion can resolve these three problems efficiently.
[1]Plya, D., “Data PreParation for Data Mining”, Morgan Kaufmann Publishers, (1999).
[2]Jiawei, H. and Micheline, K., “Data Mining : Concepts and Techniques”, Morgan Kaufmann Publishers, (2000).
[3]Batista, G. E. A. P. A., Monard, M. C. “An Analysis of Four Missing Data Treatment Methods for Supervised Learning”, Applied Artificial Intelligence, Vol. 17, No 5-6, pp. 519-533 (2003).
[4]Huang, C. C., and Lee, H. M., “A Grey-Based Nearest Neighbor Approach for Missing Attribute Value Prediction”, Applied Intelligence, Vol. 20, No.2, pp. 239-252, (2004).
[5]Acuna, E. and Rodriguez, C., “The Treatment of Missing Values and its Effect in The Classifier Accuracy”, Classification, Clustering and Data Mining Applications, Springer-Verlag Berlin-Heidelberg, pp. 639-648, (2004).
[6]Hsiao, H. R. and Chen, S. M., “A New Automatic Clustering Algorithm for Fuzzy Query Processing”, Proceedings, 6th Conference on Artificial Intelligence and Applications, Kaohsiung, Taiwan, Republic of China, pp. 550-555, (2001).
[7]Hsiao, H. R. and Chen, S. M., “A New Method to Estimate Null Values in Relational Database Systems Based on Automatic Clustering Techniques”, Information Sciences: An International Journal, vol. 169, no. 1-2, pp. 47-69, (2005).
[8]Little, R. J. and Rubin, D. B., “Statistical Analysis with Missing Data”, John Wiley and Sons, (2002).
[9]Anderberg, M. R., “Cluster Analysis for Applications”, Academic Press Inc, (1973).
[10]Rubin, D. B., “Multiple Imputation for Nonresponse in surveys”, Wiley, (1987).
[11]Kalton, G. and Kasprxyk, D., “Imputing for missing survey response”, Proceedings of the Survey. Research Methods Section, American Statistical Association, pp. 146-151(1982)
[12]Shen, J. J. and Chen, M. T., “A Recycle Technique of Association Rule for Missing Value Completion”, Proceedings, 17th International Conference on Advanced Information Networking and Applications, Xi’an China, pp. 638-641, (2003)
[13]Wu, C. H., Wun, C. H., and Chou, H. J., “Using Association Rules for Completing Missing Data”, Proceedings, 4th International Conference on Hybrid Intelligent Systems, Kitakyushu, Japan, pp. 236-214, (2004).
[14]Lee, R. C. T., Slagle, J. R., and Mong, C. T., “Application of Clustering to Estimate Missing Data and Improve Data Integrity”, Proceedings, 2nd international conference on Software engineering, San Francisco, USA, pp.539-544, (1976)
[15]Tseng, S. M., Wang K. H., and Lee, C. I., “A Pre-processing Method to Deal with Missing Values by Integrating Clustering and Regression Techniques”, Applied Artificial Intelligence, Vol. 17, No 5-6, pp. 535-544, (2003).
[16]Pedreira, C.E.; Parente, E., “Neural Networks with Missing Values Attributes”, Proceedings, IEEE International Conference on Neural Networks, Vol. 6, pp. 3021-3023, (1995).
[17]Fariñas, M. and Pedreira, C.E., “Missing Data Interpolation By Using Local-Global Neural Networks”, Engineering Intelligent Systems for Electrical engerneering and Comunications, Vol.10, No.2 pp. 85-91, (2002)
[18]Wen, Y. H., Lee, T. T., Cho, H. j., “Missing Data Treatment and Data Fusion Toward Travel Time Estimation For ATIS”, Journal of the Eastern Asia Society for Transportation Studies, Vol. 6, pp. 2546-2560, (2004).
[19]troyanskaya, O., Cantor, M., and et al., “Missing value estimation methods for dna microarrays”, Bioinformatics, Vol. 17, pp. 520-525, (2001).
[20]Kim, H., Golub, G. H., Park, H., “Missing value estimation for DNA microarray gene expression data: local least squares imputation”, Bioinformatics, Vol. 21, No 2, pp. 187-198, (2005).
[21]Deng, J. L., “Introduction to Grey System Theory”, The Journal of Grey System, Vol. 1, No. 1, pp. 49-54 (1989).
[22]Hsiao, H. R. and Chen, S. M., “A New Automatic Clustering Algorithm for Fuzzy Query Processing”, Proceedings, 6th Conference on Artificial Intelligence and Applications, Kaohsiung, Taiwan, Republic of China, pp. 550-555, (2001).
[23]Chen, S. M. and Hsiao, H. R., “A New Method to Estimate Null Values in Relational Database Systems Based on Automatic Clustering Techniques”, Information Sciences: An International Journal, Vol. 169, No. 1-2, pp. 47-69, (2005).
[24]Schena, M., Shalon, D., and et al., “Quantitative monitoring of gene expression patterns with a complementary DNA microarray”, Science, Vol. 270, pp. 467-470 (1995).
[25]DeRisi, J., Penland, L., and et al., “Use of a cDNA microarray to analyze gene expression patterns in human cancer”, Nature Genetics Vol. 14, pp. 457-460 (1996).
[26]Spellman, P.T., Sherlock, G., and et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization”, Mol Biol Cell, Vol. 9, pp. 3273-3297 (1998).
http://genome-www.stanford.edu/cellcycle/
[27]Cho, R.J., Campbell, M.J., and et al. “A genome-wide transcriptional analysis of the mitotic cell cycle”, Molecular Cell, Vol. 2, 65–73 (1998).
[28]Huang, X. B., Tang, J., “A Method for Feature Selection on Microarray Data Using Support Vector Machine”, Proceedings, 8th International Conference, DaWaK, Krakow, Poland, September 4-8, pp. 513-523, (2006).
[29]Hu, J., Li, H., Waterman, M., and Zhou, Xi., “Integrative missing value estimation for micraarray data”, BMC Bioinformatics, Vol. 7, Issue 1,449, (2006).
[30]Sehgal, M.S.B., Gondal, I., Dooley, L., “K-ranked covariance based missing values estimation for microarray data classification”, Proceedings, 4th International Conference on Hybrid Intelligent Systems, Japan, December 5-8, pp. 274-279, (2006).
[31]Merz, C. A., Murphy, P., Aha, D., UCI repository of Machine Learning databases. Dept. of Information and C.S., University of California, Irvine. http://mlearn.ics.uci.edu/MLSummary.html
[32]Zhu, G., Spellman, P.T., and et al., “Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth”, Nature, 406:90–94, (2000).
http://genome-www.stanford.edu/fkh/
[33]Yoshimoto, H., Saltsman, and et al., “Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling pathway in Saccharomyces cerevisiae”, Journal of Biological Chemistry, 277(34):31079-88, (2002).
http://sgdlite.princeton.edu/download/yeast_datasets/