簡易檢索 / 詳目顯示

研究生: 蔡暉毅
Hui-yi Tsai
論文名稱: 以機率抽樣為基礎應用在微陣列資料的基因挑選方法
A Sampling-based Gene Selection Method for Microarray Data
指導教授: 呂永和
Yung-ho Leu
口試委員: 楊維寧
Wei-Ning Yang
陳雲岫
none
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 52
中文關鍵詞: 基因挑選基因微陣列機率抽樣卡方同質性檢定
外文關鍵詞: Gene selection, Microarray data, Probability sampling, χ2-test for homogeneity
相關次數: 點閱:309下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

近年來,基因微陣列(microarray)技術已經成為生物學家一項重要的研究工具。微陣列技術可以在一次實驗中獲得大量的基因表現資料,藉由分析微陣列的資料,我們可以選出對於疾病診斷具有重要性的基因。然而基因微陣列的資料特性卻與一般傳統的統計資料大相逕庭,典型的基因微陣列資料擁有數千甚至上萬個基因,卻因為樣本收集不易而僅有數十筆樣本數量,這種高特徵維度、低樣本數量的資料特性容易造成分類與預測時產生極大的誤差。因此近幾年來學者將研究重點擺在「基因的挑選方法」與「小樣本高特徵維度的分類方法」上,但是這些方法大都計算複雜不易使用或者需要花較多時間訓練出分類模型,因此本論文提出一套簡單的基因挑選流程,結合常用的統計分析方法,來達到較高的分類正確率。

本論文以機率抽樣為基礎應用在微陣列資料的基因挑選方法,提出一套三個階段的基因挑選流程。第一階段依據不同基因資料集的特性使用統計學上的t檢定將基因分成三個組別,然後將不重要的組別刪去,作為初步降低特徵維度的工具。第二階段考量到有許多基因是必須共同表現才具有分類重要性的特性,因此使用機率抽樣的概念來產生大量基因子集合,並透過設定門檻值來保留具有分類重要性的基因子集合。第三階段則是使用統計學上的卡方同質性檢定來挑選出最後具有分類重要性的基因,組成基因集合。實驗結果證明,在白血病(Leukemia)、結腸癌(Colon)、淋巴癌(Lymphoma)三個公開的基因資料集上面,我們提出的基因挑選流程是一套簡單、可靠且只要挑出少數幾個重要的基因就具備高分類正確率的基因挑選法。


Microarray technology has become an important tool for biologists in recent years. It can simultaneously measure expression levels of many genes in one experiment. One of the research issues of microarray is to select a set of relevant genes from large number of genes to assist clinical diagnosis. Microarray data features in high dimensional data with relatively small number of samples. That is, it usually contains thousands of genes (sometimes more than ten thousand genes) and less than 100 samples. The characteristic of microarray data renders low accuracy in clinical diagnosis. To counter this problem, many researchers have focused on “gene selection” and “dimension reduction of microarray” in recent years. However, the existing methods are usually very complex and time-consuming.

In this thesis, we propose a novel method for gene selection on microarray data. In the proposed method, we first classify genes into different groups according to their expression levels in the microarray data. Then, we use the probability sampling method to generate a large number of candidate gene subsets. Finally, we use χ2-test for homogeneity to select the relevant genes from candidate gene subsets. The experiment results show that the proposed method is better than the existing methods in terms of classification accuracy and the number of genes selected.

第一章 緒論 1.1. 研究背景 1.2. 研究目的與動機 1.3. 論文大綱 第二章 文獻探討 2.1. 基因微陣列 2.1.1. 微陣列流程 2.1.2. 基因微陣列資料型態 2.1.3. 基因微陣列資料應用 2.2. 混合式基因挑選法 2.2.1. 資訊增益 2.2.2. 階層式分群 2.2.3. 支援向量機 2.3. 二階段分類挑選法 2.3.1. t統計量 2.3.2. BW比率 2.3.3. K個最近鄰居法 2.4. ENSEMBLE METHOD 2.4.1. 類神經網路 2.4.2. 粒子群最佳化 2.4.3. 分布估計演算法 2.5. 模糊邏輯挑選法 2.5.1. 模糊集合 2.5.2. 模糊邏輯 第三章 研究方法 3.1. 實驗方法架構 3.2. 第一階段:降低維度 3.2.1. 微陣列資料特性 3.2.2. 雙母體假設檢定 3.2.3. 去掉不具有分類重要性的組別 3.3. 第二階段:利用機率抽樣產生基因子集合 3.3.1. 計算各組別特徵值 3.3.2. 產生大量基因子集合 3.3.3. 篩選基因子集合 3.4. 第三階段:挑選重要基因 3.4.1. 卡方同質性檢定 3.5. 評估流程 第四章 實驗結果與分析 4.1. 資料來源及說明 4.2. 參數設定 4.2.1. 第一階段 4.2.2. 第二階段 4.2.3. 第三階段 4.3. 結果比較與分析 4.3.1. 白血病 4.3.2. 結腸癌 4.3.3. 淋巴癌 4.3.4. 與其他方法之比較 4.4. 小結 第五章 結論與未來展望 5.1. 結論 5.2. 未來展望 參考文獻

Alizadeh, A.A., Elsen, M.B., Davis, R.E., Ma, C.L., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marü, G.E., Moore, T., Hudson Jr, J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., Staudt, L.M., 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503-511.

Alon, U., Barka, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America 96, 6745-6750.

Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., Den Boer, M.L., Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R., Korsmeyer, S.J., 2002. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 30, 41-47.

Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares Jr, M., Haussler, D., 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences of the United States of America 97, 262-267.

Chen, D., Hua, D., Reifman, J. and Cheng, X., 2003. Gene Selection for Multi-Class Prediction of Microarray Data, IEEE Computer Society Conference on Bioinformatics. IEEE Computer Society, Washington, DC, USA, p. 492.

Chen, Y., Zhao, Y., 2008. A novel ensemble of classifiers for microarray data classification. Applied Soft Computing Journal 8, 1664-1669.

Cho, J.H., Lee, D., Park, J.H., Lee, I.B., 2004. Gene selection and classification from microarray data using kernel machine. FEBS Letters 571, 93-98.

de Haan, J.R., Wehrens, R., Bauerschmidt, S., Piek, E., van Schaik, R.C., Buydens, L.M.C., 2007. Interpretation of ANOVA models for microarray data using PCA. Bioinformatics 23, 184-190.

Deb, K., Raji Reddy, A., 2003. Reliable classification of two-class cancer data using evolutionary algorithms. BioSystems 72, 111-129.

Dudoit, S., Fridlyand, J., Speed, T.P., 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97, 77-86.

Eberhart, R., Kennedy, J., 1995. New optimizer using particle swarm theory, pp. 39-43.

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S., 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531-527.

Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classification using support vector machines. Machine Learning 46, 389-422.

Huerta, E., Duval, B., Hao, J.K., 2008. Fuzzy Logic for Elimination of Redundant Information of Microarray Data. Genomics, Proteomics and Bioinformatics 6, 61-73.

Jaeger, J., Sengupta, R., Ruzzo, W.L., 2003. Improved gene selection for classification of microarrays. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 53-64.

Jirapech-Umpai, T., Aitken, S., 2005. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics 6.

Kohavi, R., John, G.H., 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 273-324.

Lee, C.P., Leu, Y., 2009. A novel hybrid feature selection method for microarray data analysis. Applied Soft Computing Journal.

Li, L., Darden, T.A., Weinberg, C.R., Levine, A.J., Pedersen, L.G., 2001. Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Combinatorial Chemistry and High Throughput Screening 4, 727-739.

Li, T., Zhang, C., Ogihara, M., 2004. A comparative study of feature selection and multiclass classfication methods for tissue classification based on gene expression. Bioinformatics 20, 2429-2437.

Liu, H., Li, J., Wong, L., 2002. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome informatics series : proceedings of the . Workshop on Genome Informatics. Workshop on Genome Informatics 13, 51-60.

Model, F., Adorján, P., Olek, A., Piepenbrock, C., 2001. Feature selection for DNA methylation based cancer classification. Bioinformatics 17, S157-S164.

Nguyen, D.V., Rocke, D.M., 2002. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39-50.

P. Larranaga, E.B., I. Bloch, A. Perchant, 2001. Estimation of Distribution Algorithms: A New Evolutionary Computation Approach for Graph Matching Problems Springer Berlin / Heidelberg.

Qin, J., Lewis, D.P., Noble, W.S., 2003. Kernel hierarchical gene clustering from microarray expression data. Bioinformatics 19, 2097-2104.

Quinlan, J.R., 1983. Learning Efficient Classification Procedures and Their Application to Chess and Games. Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann, pp. 463-482.

Robnik-Šikonja, M., Kononenko, I., 2003. Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning 53, 23-69.

Schena, M., Shalon, D., Davis, R.W., Brown, P.O., 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-470.

T.K. Paul, H.I., 2004. Section of the most useful subset of genes fpr gene expression-based classification. Proceedings of the IEEE congress on evolutionary computation, 2076-2083.

Wang, L., Chu, F., Xie, W., 2007. Accurate cancer classification using expressions of very few genes. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4, 40-53.

Wang, Y., Makedon, F.S., Ford, J.C., Pearlman, J., 2005. HykGene: A hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics 21, 1530-1537.

Wong, T.T., Hsu, C.H., 2008. Two-stage classification methods for microarray data. Expert Systems with Applications 34, 375-383.

尹邦嚴,柳依旻,江元傑,黃冠哲,陳映良 (2005),粒子族群最佳化的視覺化及開發工具,暨南大學,南投

吳宏一,羅珮華,林榮信,張明富 (2006),微陣列分析 Microarray,台大醫學院

葉昌偉,謝昌煥 (2005),基因晶片簡介與分析及應用軟體介紹,國家高速網路與計算中心

李允中,王小璠,蘇木春 (2003),模糊理論及其應用,全華書局,台北

陳順宇 (2004),多變量分析,華泰書局,台北

QR CODE