簡易檢索 / 詳目顯示

研究生: 吳元彰
Yuan-chang Wu
論文名稱: 應用近似鄰居方法與灰關聯分析於資料庫中遺失值填補問題
Applying nearest neighbors approach and grey analysis to missing values completion in database
指導教授: 楊鍵樵
Chen-Chau Yang
口試委員: 陳省隆
Hsing-Lung Chen
鍾聖倫
Sheng-Luen Chung
陳振楠
Jenn-Nan Chen
朱雨其
Yu-Chi Chu
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 83
中文關鍵詞: 遺失值近似鄰居插補法相似灰關聯度基因表現資料
外文關鍵詞: missing values, nearest neighbors imputation, similarity, grey relation, gene expression data
相關次數: 點閱:282下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究係針對近似鄰居插補法處理不同性質資料集時所衍生出的定義相似度、執行效能與完整資料集選取三方面問題,提出對應的解決方法。對於屬性間關聯程度差異性大的資料集,本文改良灰關聯預測遺失值方法,提出加權式灰關聯度作為其相似度計算方式,且考量其他屬性與含遺失值屬性的關聯性,並依此定義屬性的權重因子。為了減少近似鄰居插補法在大型資料集中搜尋近似紀錄所花費的時間,本研究運用候選資料集取代完整資料集作為加權式灰關聯法搜尋近似資料之集合以節省資料比對次數。為了解決基因表現資料集因屬性遺失率高導致完整紀錄過少的問題,本研究提出了二階段填補遺失值的作法,首先對於含遺失值的紀錄先做第一次的填補,以得到一個完整不含遺失值的資料集,然後再利用此完整資料集對每一個遺失值重新作一次填補。根據實驗結果顯示,本論文所提出的三個遺失值填補方法確實可有效解決這三方面的問題。


    The main thesis of this article is to resolve three problems (similarity measurement, performance and complete data collection) occurring in the process of applying nearest neighbors imputation in different datasets. First, for datasets that have large different attribute relations, the grey relational analysis for missing values prediction is modified by adding weighting factors of attributes and the weighted grey relational analysis is proposed to calculate the similarity between records. Second, when calculating similarity between records in the large dataset, the candidate set is used to reduce the number of comparisons in complete set. Third, two-stage missing values completion approach is proposed to resolve the problem that few complete records collected in the gene expression data set. In the first stage, complete source dataset is produced by completing missing values. In second stage, every missing value is completed again based on complete source dataset. Experimental results show that our three approaches for missing values completion can resolve these three problems efficiently.

    中文摘要 ............................. i ABSTRACT............................. ii 誌 謝................................iii 圖索引............................... vi 表索引...............................vii 第一章 緒論........................................ 1 1.1 研究背景與動機……………………………………1 1.2 研究方法……………………………………………4 1.3 章節概要.……………………………………………5 第二章 相關文獻.................................... 6 2.1 遺失值的定義與分類……………………………….6 2.2 遺失值處理方式…………………………………….8 2.3 近似鄰居插補法……………………………………10 2.4 相似度量測方法……………………………………12 2.4.1 距離量測法…………………………………...12 2.4.2 灰關聯分析…………………………………...13 第三章 運用灰關聯分析填補遺失值................... 18 3.1 應用灰關聯分析填補遺失值方法之回顧…………18 3.2 加權式灰關聯分析填補遺失值……………………21 3.3 範例說明……………………………………………23 第四章 填補遺失值之整合方法...................... 29 4.1 自動分群演算法之回顧……………………………30 4.2 整合方法:完整資料前置處理階段……………….33 4.2.1 建立屬性權重關係…………………………...33 4.2.2 屬性值域切割………………………………...33 4.3 整合方法:建立候選集合階段……………………35 4.4 整合方法:遺失值填補階段……………………….36 第五章 基因表現資料中填補遺失值的新方法...................38 5.1 微陣列技術…………………………………………38 5.2 基因表現資料中的遺失值問題……………………40 5.3 局部最小平方插補法………………………………41 5.4 新方法………………………………………………43 5.4.1 第一階段……………………………………...45 5.4.2 第二階段……………………………………...48 5.5 範例說明……………………………………………49 5.5.1 第一階段計算過程…………………………...49 5.5.2 第二階段計算過程…………………………...54 5.5.3 誤差率比較…………………………………...55 第六章 實驗結果及討論............................ 58 6.1 實驗環境……………………………………………58 6.2 實驗資料……………………………………………58 6.2.1 Iris 資料集…………………………………...59 6.2.2 Liver-disorders 資料集………………………59 6.2.3 空氣品質監測資料…………………………...60 6.2.4 基因表現資料………………………………...61 6.3 實驗結果及討論……………………………………61 6.3.1 實驗一………………………………………...62 6.3.2 實驗二………………………………………...65 6.3.3 實驗三………………………………………...71 第七章 結論............................................... 76 參考文獻.................................................. 79

    [1]Plya, D., “Data PreParation for Data Mining”, Morgan Kaufmann Publishers, (1999).
    [2]Jiawei, H. and Micheline, K., “Data Mining : Concepts and Techniques”, Morgan Kaufmann Publishers, (2000).
    [3]Batista, G. E. A. P. A., Monard, M. C. “An Analysis of Four Missing Data Treatment Methods for Supervised Learning”, Applied Artificial Intelligence, Vol. 17, No 5-6, pp. 519-533 (2003).
    [4]Huang, C. C., and Lee, H. M., “A Grey-Based Nearest Neighbor Approach for Missing Attribute Value Prediction”, Applied Intelligence, Vol. 20, No.2, pp. 239-252, (2004).
    [5]Acuna, E. and Rodriguez, C., “The Treatment of Missing Values and its Effect in The Classifier Accuracy”, Classification, Clustering and Data Mining Applications, Springer-Verlag Berlin-Heidelberg, pp. 639-648, (2004).
    [6]Hsiao, H. R. and Chen, S. M., “A New Automatic Clustering Algorithm for Fuzzy Query Processing”, Proceedings, 6th Conference on Artificial Intelligence and Applications, Kaohsiung, Taiwan, Republic of China, pp. 550-555, (2001).
    [7]Hsiao, H. R. and Chen, S. M., “A New Method to Estimate Null Values in Relational Database Systems Based on Automatic Clustering Techniques”, Information Sciences: An International Journal, vol. 169, no. 1-2, pp. 47-69, (2005).
    [8]Little, R. J. and Rubin, D. B., “Statistical Analysis with Missing Data”, John Wiley and Sons, (2002).
    [9]Anderberg, M. R., “Cluster Analysis for Applications”, Academic Press Inc, (1973).
    [10]Rubin, D. B., “Multiple Imputation for Nonresponse in surveys”, Wiley, (1987).
    [11]Kalton, G. and Kasprxyk, D., “Imputing for missing survey response”, Proceedings of the Survey. Research Methods Section, American Statistical Association, pp. 146-151(1982)
    [12]Shen, J. J. and Chen, M. T., “A Recycle Technique of Association Rule for Missing Value Completion”, Proceedings, 17th International Conference on Advanced Information Networking and Applications, Xi’an China, pp. 638-641, (2003)
    [13]Wu, C. H., Wun, C. H., and Chou, H. J., “Using Association Rules for Completing Missing Data”, Proceedings, 4th International Conference on Hybrid Intelligent Systems, Kitakyushu, Japan, pp. 236-214, (2004).
    [14]Lee, R. C. T., Slagle, J. R., and Mong, C. T., “Application of Clustering to Estimate Missing Data and Improve Data Integrity”, Proceedings, 2nd international conference on Software engineering, San Francisco, USA, pp.539-544, (1976)
    [15]Tseng, S. M., Wang K. H., and Lee, C. I., “A Pre-processing Method to Deal with Missing Values by Integrating Clustering and Regression Techniques”, Applied Artificial Intelligence, Vol. 17, No 5-6, pp. 535-544, (2003).
    [16]Pedreira, C.E.; Parente, E., “Neural Networks with Missing Values Attributes”, Proceedings, IEEE International Conference on Neural Networks, Vol. 6, pp. 3021-3023, (1995).
    [17]Fariñas, M. and Pedreira, C.E., “Missing Data Interpolation By Using Local-Global Neural Networks”, Engineering Intelligent Systems for Electrical engerneering and Comunications, Vol.10, No.2 pp. 85-91, (2002)
    [18]Wen, Y. H., Lee, T. T., Cho, H. j., “Missing Data Treatment and Data Fusion Toward Travel Time Estimation For ATIS”, Journal of the Eastern Asia Society for Transportation Studies, Vol. 6, pp. 2546-2560, (2004).
    [19]troyanskaya, O., Cantor, M., and et al., “Missing value estimation methods for dna microarrays”, Bioinformatics, Vol. 17, pp. 520-525, (2001).
    [20]Kim, H., Golub, G. H., Park, H., “Missing value estimation for DNA microarray gene expression data: local least squares imputation”, Bioinformatics, Vol. 21, No 2, pp. 187-198, (2005).
    [21]Deng, J. L., “Introduction to Grey System Theory”, The Journal of Grey System, Vol. 1, No. 1, pp. 49-54 (1989).
    [22]Hsiao, H. R. and Chen, S. M., “A New Automatic Clustering Algorithm for Fuzzy Query Processing”, Proceedings, 6th Conference on Artificial Intelligence and Applications, Kaohsiung, Taiwan, Republic of China, pp. 550-555, (2001).
    [23]Chen, S. M. and Hsiao, H. R., “A New Method to Estimate Null Values in Relational Database Systems Based on Automatic Clustering Techniques”, Information Sciences: An International Journal, Vol. 169, No. 1-2, pp. 47-69, (2005).
    [24]Schena, M., Shalon, D., and et al., “Quantitative monitoring of gene expression patterns with a complementary DNA microarray”, Science, Vol. 270, pp. 467-470 (1995).
    [25]DeRisi, J., Penland, L., and et al., “Use of a cDNA microarray to analyze gene expression patterns in human cancer”, Nature Genetics Vol. 14, pp. 457-460 (1996).
    [26]Spellman, P.T., Sherlock, G., and et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization”, Mol Biol Cell, Vol. 9, pp. 3273-3297 (1998).
    http://genome-www.stanford.edu/cellcycle/
    [27]Cho, R.J., Campbell, M.J., and et al. “A genome-wide transcriptional analysis of the mitotic cell cycle”, Molecular Cell, Vol. 2, 65–73 (1998).
    [28]Huang, X. B., Tang, J., “A Method for Feature Selection on Microarray Data Using Support Vector Machine”, Proceedings, 8th International Conference, DaWaK, Krakow, Poland, September 4-8, pp. 513-523, (2006).
    [29]Hu, J., Li, H., Waterman, M., and Zhou, Xi., “Integrative missing value estimation for micraarray data”, BMC Bioinformatics, Vol. 7, Issue 1,449, (2006).
    [30]Sehgal, M.S.B., Gondal, I., Dooley, L., “K-ranked covariance based missing values estimation for microarray data classification”, Proceedings, 4th International Conference on Hybrid Intelligent Systems, Japan, December 5-8, pp. 274-279, (2006).
    [31]Merz, C. A., Murphy, P., Aha, D., UCI repository of Machine Learning databases. Dept. of Information and C.S., University of California, Irvine. http://mlearn.ics.uci.edu/MLSummary.html
    [32]Zhu, G., Spellman, P.T., and et al., “Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth”, Nature, 406:90–94, (2000).
    http://genome-www.stanford.edu/fkh/
    [33]Yoshimoto, H., Saltsman, and et al., “Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling pathway in Saccharomyces cerevisiae”, Journal of Biological Chemistry, 277(34):31079-88, (2002).
    http://sgdlite.princeton.edu/download/yeast_datasets/

    無法下載圖示 全文公開日期 2012/07/26 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE