簡易檢索 / 詳目顯示

研究生: 張嬡真
Ai-chen Chang
論文名稱: 結合資訊增益與基因演算法的微陣列基因篩選方法
Gene Selection of Microarray Data Using Information Gain and Genetic Algorithms
指導教授: 呂永和
Yung-ho Leu
口試委員: 楊維寧
Wei-ning Yang
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 50
中文關鍵詞: 基因挑選基因微陣列資訊增益基因演算法
外文關鍵詞: Gene selection, Microarray data, Information gain, Genetic algorithms
相關次數: 點閱:310下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於每一次的生物晶片實驗所需要的成本較高,所以生物晶片實驗通常只會有數十個到數百個實驗樣本,導致微陣列資料擁有小樣本、高特徵維度的特色,不利於使用在分類方法上,因此目前微陣列分析的主要研究方向是要如何從成千上萬筆的基因中挑選出真正影響疾病的基因。故本研究在結合統計分析與資料探勘方法的基礎下,提出一個新的混合式的三階段基因篩選方法,希望能在找出少數幾個具有重要分類辨識資訊的基因,就能達到較高的分類正確率。
    本研究在基因篩選(Gene Selection)方法上分為三個階段,第一階段使用資訊增益(Information Gain)方法衡量每個基因分類樣本資料的效果;第二階段利用相關係數作為分群的依據,將相關係數大於門檻值的基因分為同一群,再將資訊增益值最大的基因挑選出來;而後,再利用變異數分析,針對挑選出來的基因進行變異數分析檢定,探討不同類別疾病樣本間的基因表現量是否有差異,將檢定結果為沒有顯著差異的基因刪除,反之則留下,表示此基因集合即為對於疾病有較佳鑑別性的候選基因。第三階段則是利用基因演算法(Genetic Algorithm)做進一步的基因篩選,再利用支援向量機方法計算訓練資料的正確率(Accuracy)。實驗結果證明,本研究提出的基因挑選方法,可在挑出少數重要基因的前提下,具備較高分類正確率。


    Due to the steep expense of a microarray experiment, a microarray experiment dataset usually contains only dozens or hundreds of experiment samples with thousands of genes. The characteristic of small sample size and high dimensional features is unfavorable to the use of microarray data in diagnosis or other clinical applications. Therefore, many researchers have focused on the gene selection problem which is to select the least number of genes with high classification accuracy. In this research, we propose a new hybrid three-stage gene selection method, which uses statistical analysis and data mining techniques.
    The proposed gene selection method comprises three stages. In the first stage, we use the information gain of each gene in the microarray dataset as a measure for filtering out genes with less discrimination in classification. In the second stage, genes are clustered according to their correlation coefficients. Genes with high correlation coefficients are grouped together as a cluster. Each cluster then contributes the gene with the highest information gain as a candidate gene for the third stage. The candidate genes are tested for effectiveness in classification using the analysis of variance (ANOVA).The remaining genes passing the test are genes with better discriminative capability in classification. Finally, we use a genetic algorithm to select a subset of the remaining genes as the final set of selected genes. The experiment results show that the proposed method is better than the existing methods in terms of the classification accuracy and the number of selected genes.

    摘要 ABSTRACT 目錄 圖目錄 表目錄 第一章 緒論 1.1 研究背景 1.2 研究目的與動機 1.3 論文大綱 第二章 文獻探討 2.1 生物晶片 2.2 基因微陣列簡介 2.2.1. 基因微陣列流程 2.2.2. 基因微陣列資料型態 2.2.3. 基因微陣列資料應用 2.3 混合式基因挑選法 2.3.1. 資訊增益 2.3.2. 支援向量機 2.3.3. 相關係數 2.3.3.1 皮爾森相關係數 2.3.3.2 生態相關係數 2.3.3.3 K-類別相關係數 2.3.4. T檢定 2.3.5. 變異數分析 2.3.6. BW比率 2.3.7. 基因演算法 第三章 研究方法 3.1 研究方法架構 3.2 資料前處理 3.2.1 微陣列資料的正規化 3.2.2 微陣列資料的離散化 3.3 基因篩選方法 3.3.1 第一階段:利用資訊增益方法 3.3.2 第二階段:利用相關係數分析與分群 3.3.3 第三階段:基因演算法 3.4 正確率計算 第四章 實驗結果與分析 4.1 資料集來源與介紹 4.2 結果比較與分析 4.2.1 第一階段基因篩選結果:利用資訊增益方法 4.2.2 第二階段基因篩選結果:利用相關係數分析與分群 4.2.3 第三階段基因篩選結果:利用基因演算方法與正確率結果 4.2.4 與其他方法之比較 第五章 結論與未來展望 5.1 結論 5.2 未來展望 參考文獻

    [1] Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., Den Boer, M. L., Minden, M. D., . . . Korsmeyer, S. J. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 30(1), pp. 41 - 47.
    [2] Asyali, M. H. (2007, 12). Gene expression profile class prediction using linear Bayesian classifiers. Computers in Biology and Medicine, 12(37), pp. 1690-1699.
    [3] Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., . . . Haussler, D. (2000, 1 4). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences of the United States, 97(1), pp. 262-267.
    [4] Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, pp. 273-297.
    [5] Ding, C., & Peng, H. (2003, 8). Minimum redundancy feature selection from microarray gene expression data. IEEE Computer Society Bioinformatics Conference, (pp. 523-528).
    [6] Dudoit, S., Fridlyand, J., & Speed, T. P. (2002, 6). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), pp. 77-87.
    [7] Fayyad, U. M., & Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. International Joint Conference on Artificial Intelligence, pp. 1022-1029.
    [8] Fisher, S. R. (1925).
    [9] Fodor, S. P., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T., & Solas, D. (1991, 2 15). Light-directed, spatially addressable parallel chemical synthesis. Science, 251(4995), pp. 767-773.
    [10] Freedman, D., & Pisani, R. (2007). Statistics (4th edition ed.).
    [11] Gorodkin, J. (2004, 12). Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry, 28(5-6), pp. 367–374.
    [12] Lee, C.-P., & Leu, Y.-H. (2011, 1). A novel hybrid feature selection method for microarray data analysis. Applied Soft Computing, 11(1), pp. 208–213.
    [13] Liu, H., Li, J., & Wong, L. (2002). A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome informatics International Conference on Genome Informatic, 13, pp. 51-60.
    [14] Luo, L.-K., Huang, D.-F., Ye, L.-J., Zhou, Q.-F., Shao, G.-F., & Peng, H. (2011, 1). Improving the Computational Efficiency of Recursive Cluster Elimination for Gene Selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 8(1), pp. 122-129.
    [15] Nguyen, D. V., & Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1), pp. 39-50.
    [16] Quinlan, J. R. (1983). Learning Efficient Classification Procedures and Their Application to Chess End-Games. In Machine Learning: An Artificial Intelligence Approach, pp. 463-482.
    [17] Robinson, W. S. (1950). Ecological Correlations and the Behavior of Individuals. American Sociological Association, 15(3), pp. 351-357.
    [18] Schena, M., Shalon, D., Davis, R. W., & Brown, P. O. (1995, 10). Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science, 270(5235), pp. 467-470.
    [19] Shannon, C. E. (1949). A Mathematical Theory of Communication. The Mathematical Theory of Communication.
    [20] Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005, 3). A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis. Bioinformatics, 21(5), pp. 631-643.
    [21] Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002, 5 14). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of The National Academy of Sciences(PNAS), 99(10), pp. 6567-6572.
    [22] Wang, Y., Makedon, F. S., Ford, J. C., & Pearlman, J. (2005). HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics, 21(8), pp. 1530-1537.
    [23] Yang, C.-H., Chuang, L.-Y., & Yang, C.-H. (2010, 2). IG-GA: A Hybrid Filter/Wrapper Method for Feature Selection of Microarray Data. Journal of Medical and Biological Engineering, 30(1), pp. 23-28.
    [24] Yousef, M., Jung, S., Showe, L. C., & Showe, M. K. (2007, 5 2). Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics, 8, p. 144.
    [25] 王良吉 (2007年1月)。應用PSO演算法於分類法則之探勘。國立高雄第一科技大學資訊管理系碩士論文。
    [26] 李謠 (2006年12月)。基因晶片技術-解碼生命。新文京開發出版股份有限公司。
    [27] 林惠玲、陳正倉 (2006)。應用統計學 (第 三 版) 。雙葉書廊有限公司。
    [28] 張云濤、龔玲 (2007)。資料探勘原理與技術。五南圖書出版股份有限公司。
    [29] 楊維寧 (2006)。統計學 (第 二 版)。新陸書局股份有限公司。
    [30] 賴永耀 (2007)。利用虛擬基因表現資料提升研究初期癌症辨識率。國立成功大學資訊管理研究所碩士論文。

    QR CODE