Basic Search / Detailed Display

Author: 劉承剛
Cheng-Kang Liu
Thesis Title: 以序列比對為基礎並應用分類器技術擷取並整合生物資訊來源之蛋白質序列註解系統
Improving SIM-based annotation method of protein sequence using support vector machine
Advisor: 李漢銘
Hahn-Ming Lee
Committee: 何正信
Cheng-Seen Ho
何建明
Jan-Ming Ho
黃淇竣
Chi-Chun Huang
鮑興國
Hsing-Kuo Pao
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2005
Graduation Academic Year: 93
Language: 英文
Pages: 78
Keywords (in Chinese): 分類功能註解蛋白質序列生物資訊
Keywords (in other languages): Bioinformatics, Protein sequence, Function annotation, Classification
Reference times: Clicks: 257Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 蛋白質註解是幫助生物學家了解蛋白質功能的重要資訊,然而隨著近年來生物序列定序技術的蓬勃發展,大量的蛋白質序列被定序完成,傳統使用人力來產生註解的方式已無法應付快速產生的序列資料,利用電腦技術來快速自動產生蛋白質註解成為生物資訊學上重要的課題。此外這些已知功能的序列和其註解資訊往往分布在不同的生物資訊資料庫,為了查詢所需資料,常要透過許多不同的網站和資料庫才能蒐集齊全,因此如何整合這些資料,使研究人員能更簡單地進行資料的搜尋,並從中獲得有價值的資訊亦為生物資訊上另一個重要的課題。
    以序列比對的方式來尋找同源序列,進而註解功能未知的蛋白質序列為目前最廣為使用的方法,然而仍有許多功能不一致的情形存在於序列相似的蛋白質之中,因而降低了註解的正確性。本論文提出一個以序列相似比對為基礎並使用支援向量機器(SVM)來自動濾除錯誤的註解的方法,此方法能整合多種資料來源並從中擷取出註解資訊,為了增加註解的正確性,此方法將具有相同註解的相似序列放在同ㄧ群,同時考慮這些相似序列來產生註解,由此減少經由單一低相似度序列或是相似片段在非功能區域所產生的錯誤註解。實驗結果證實,我們提出的方法能有效的濾除掉錯誤的註解,並且在不同的註解系統中都能保持高準確性。


    The gap between the protein sequences and the reliable function annotation in public databases is growing. Traditional manual annotation by literature curation can not catch up with the rapid growth of new protein sequences. Thus, the automatic annotation methods of protein sequences are in great demand are in great demand. Sequence similarity (SIM) methods, such as BLAST, are the most common used method which searching for homologies and evolutionary relationship between the protein sequences. However, there are a considerable number of functional inconsistencies in similar protein sequences. Thus, a method to automatic eliminates the error annotations is needed to improve the SIM-based methods. In addition, the biological data are distributed in different databases and having their own data types. It is difficult for users to obtain these data they needed from the distributed environment. Integration of the various types of biological data into an integrated environment for function annotation of protein sequences is also an important issue.
    In this paper, we present a protein sequence annotation method, named as MAPS (Multiple Annotation for Protein Sequences), which provides a mechanism to extract multiple annotations from various types of biological data and automatic eliminates the error annotations by a pre-trained SVM classifier. It assigns an annotation to the input protein sequence by taking into account all hit proteins with this annotation entirely, not only single hit protein. This can reduce the error annotations inferred from weak sequence similarity and the sequences identity in non-functional segment. The experimental results show that the error annotations can be eliminated effectively and keep high accuracy on different types of annotations.

    Abstract II Acknowledgements IV Content V List of Tables VIII List of Figures IX Chpater 1 Introduction 1 1.1 Motivation 1 1.2 The Challenges of protein function annotation 2 1.2.1 Ambiguous cutting threshold 2 1.2.2 Error annotations 3 1.2.3 Distribution of data sources 3 1.3 Goals 4 1.4 Outline of this thesis 4 Chpater 2 Background 5 2.1 Protein function annotation 5 2.1.1 Current automatic annotation methods 6 2.2 Tools in MAPS 10 2.2.1 SwissProt 10 2.2.2 InterPro: integrated protein documentation resources 11 2.2.3 GO: gene ontology 13 2.2.4 BLAST 14 2.2.5 SVM: support vector machine 16 Chpater 3 MAPS 21 3.1 Overview of MAPS 21 3.2 System architecture of MAPS 24 3.2.1 Similar protein sequence searching unit 25 3.2.2 Annotation-based protein clustering unit 26 3.2.3 Protein cluster selecting and evaluating unit 27 3.2.4 GO annotation searching unit 29 3.3 Cluster features and domain matching score 31 3.3.1 The features of protein cluster 32 3.3.2 Supporting score 32 3.3.3 Similarity score 33 3.3.4 Domain matching score 34 3.4 Characteristics of MAPS 36 Chpater 4 Experiments 38 4.1 Experimental data 38 4.2 Experimental setup 40 4.3 Experimental results 41 4.3.1 Signature protein cluster selection 42 4.3.2 Keyword protein cluster selection 45 4.3.3 Domain matching score 48 4.3.4 Comparison between MAPS and BLAST 50 4.3.5 Comparison between MAPS and InterProScan 50 4.3.6 Agreement of GO annotations between MAPS and SGD 51 Chpater 5 Conclusion and further work 54 5.1 Discussion 54 5.2 Conclusion 56 5.3 Further work 57 References 59

    [1] Kanehisa M and Bork P. Bioinformatics in post-sequence era. Nature Genetics. 33: 305 – 310, 2003.
    [2] Pearson WR and Lipman DJ. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA. 85:2444-2448, 1988.
    [3] Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. Basic local alighment search tool. Journal Molecular Biol. 215: 403-410, 1990.
    [4] Chicurel M. Bioinformatics: Bringing it all together. Nature. 419: 751-757, 2002.
    [5] Burges CJC. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery. 2: 955-974, 1998.
    [6] Cristianini N and Shawe-Taylor J. An introduction to Support Vector Machines. Cambridge University Press, Cambridge, 2000.
    [7] Vapnik V. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
    [8] Devos D and Valencia A. Intrinsic errors in genome annotation. Trends in Genetics. 17:429-431, 2001.
    [9] Smith TF. Functional genomics – bioinformatics is ready for the challenge. Trends Genet. 14: 291-293, 1998.
    [10] Lewis S, Ashburner M and Reese MG. Annotating eukaryote genomes. Curr. Opin. Struct. Biol. 10: 349-354, 2000.
    [11] Eisenberg D, Marcotte EM, Xenarios I and Yeates TO. Protein function in the post-genomic era. Nature. 405:823-826, 2000.
    [12] Kell DB and King RD. On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends Biotechnol. 18: 93-98, 2000.
    [13] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res. 25: 3389-3402, 1997.
    [14] Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A and Bucher P. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 3: 265-274, 2002.
    [15] Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C and Eddy SR. The Pfam protein families database. Nucleic Acids Res. 32:138-141, 2004.
    [16] Attwood TK, Blythe MJ, Flower DR, Gaulton A, Mabey JE, maudling N, McGregor L, Mitchell AL, Moulton G, Paine K and Scordis P. PRINTS and PRINTS-S shed light on protein ancestry. Nucleic Acids Res. 30: 239–241, 2002.
    [17] Ponting CP, Schultz J, Milpetz F and Bork P. SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. 27: 229-232, 1999.
    [18] Haft DH, Selengut JD and White O. The TIGRFAMs database of protein families. Nucleic Acids Res. 31: 371-373, 2003.
    [19] Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G and Barker WC. PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 32: 112-114, 2004.
    [20] Gough J, Karplus K, Hughey R and Chothia C. Assignment of homology to genome sequences using a library of Hidden Markov Models that represent all proteins of known structure. Journal Molecular Biol. 313: 903-919, 2001.
    [21] Servant F, Bru C, Carrere S, Courcelle E, Gouzy J, Peyruc D and Kahn D. ProDom: Automated clustering of homologous domains. Brief Bioinform. 3: 246-251, 2002.
    [22] The InterPro consortium. InterPro- an integrated documentation resource for protein families, domains and functional sites. Bioinformatics. 16: 1145-1150, 2000.
    [23] Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Biswas M, Bradley P, Bucher P, Copley R, Courcelle E, Durbin R, Falquet L, Fleischmann W, Gouzy J, Griffith-Jones S, Haft D, Hermjakob H, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Orchard S, Pagni M, Peyruc D, Ponting CP, Servant F, and Sigrist CJA. InterPro: An integrated documentation resource for protein families, domains, and functional sites. Brief Bioinform. 3: 225-235, 2002.
    [24] Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley R, Courcelle E, Das U, Durbin R, Falquet L, Fleischmann W, Griffiths-Jones S, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Silventoinen V, Orchard S, Pagni M, Peyruc D, Ponting CP, Selengut J, Servant F, Sigrist CJ, Vaughan R and Zdobnov EM. The InterPro database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31: 315-318, 2003.
    [25] Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U, Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN, Orchard S, Pagni M, Ponting CP, Quevillon E, Selengut J, Sigrist CJ, Silventoinen V, Studholme DJ, Vaughan R and Wu CH. InterPro, progress and status in 2005. Nucleic Acids Res. 33: 201-205, 2005.
    [26] Zdobnov EM and Apweiler R. InterProScan- an integration platform for the signature-recognition method in InterPro. Bioinformatics. 17: 847-848, 2001.
    [27] Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D and Teates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA. 96: 4285-4288, 1999.
    [28] Marcotte EM, Pellegrini M, Ng HL, Rice DW, Teates TO and Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science. 285: 751-753, 1999.
    [29] Overbeek R, Fonstein M, D'Souza M, Pusch GD and Maltsev N. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA. 96: 2896-2901, 1999.
    [30] Dandekar T, Snel B, Huynen M and Bork P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23: 324-328, 1999.
    [31] Schwikowski B, Uetz P and Fields S. A network of protein-protein interactions in yeast. Nature Biotechnology. 18: 1257-1261, 2000.
    [32] Brun C, Herrmann C and Guenoche A. Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics. 5: 95-105, 2004.
    [33] Vazquez A, Flammini A, Maritan A and Vespignani A. Global protein function prediction from protein-protein interaction networks. Nature Biotechnology. 21: 697-700, 2003.
    [34] Salwinski L and Eisenberg D. Computational methods of analysis of protein-protein interactions. Curr Opin Struct Biol. 13: 377-382, 2003.
    [35] Cai YD and Doig AJ. Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition. Bioinformatics. 20: 1292-1300, 2004.
    [36] Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S and Weil B. MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 30: 31-34, 2002.
    [37] Witten IH and Frank E. Data Mining. Morgan Kaufmann publishers, 2000.
    [38] Bazzan AL, Engel PM, Schroeder LF and da Silva SC. Automated annotation of keywords for proteins related to mycoplasmtaceae using machine learning techniques. Bioinformatics. 18: 35-43, 2002.
    [39] Kretschmann E, Fleischmann W and Apweiler R. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics. 17: 920-926, 2001.
    [40] King RD, Karwath A, Clare A, and Dehaspe L. Genome scale prediction of protein functional class from sequence using data mining. Proceedings of the sixth ACM SIGKDD international. 384-389, 2000.
    [41] Clare A and King RD. Predicting gene function in Saccharomyces cerevisiae. Bioinformatics. 19: 42-49, 2003.
    [42] King RD, Wise PH and Clare A. Confirmation of data mining based prediction function. Bioinformatics. 20: 1110-1118, 2004.
    [43] Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S and Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31:365-370, 2003.
    [44] Apweiler R, Gateau A, Contrino S, Martin MJ, Junker V, O'Donovan C, Lang F, Mitaritonna N, Kappus S and Bairoch A. Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT + TREMBL. ISMB-97: Proceedings 5th International Conference on Intelligent Systems for Molecular Biology. 33-43, 1997.
    [45] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM and Sherlock G. Gene ontology: tool for unification of biology. Nature Genet. 25: 25-29, 2000.
    [46] The Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 11: 1425-1433, 2000.
    [47] Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T and White R The gene ontology (GO) database and informatics resource. Nucleic Acids Res. 32:258-261, 2004.
    [48] Henikoff S and Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 89: 10915-10919, 1992.
    [49] Schwartz R, Dayhoff M and Orcutt B. A Model Evolutionary Change in Proteins. Atlas of protein sequence and structure. 5: 345-352, 1978.
    [50] Hua S and Sun Z. A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach. Bioinformatics. 308:397-407, 2001.
    [51] Yang X and Wang B. Weave amino acid sequences for protein secondary structure prediction. June 2003 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery.
    [52] Teow LN and Loe KF. Robust vision-based features and classification schemes for off-line handwritten digit recognition. Pattern Recognition. 35:2355-2364, 2002.
    [53] Chuang NY. ESTFastAnnotator: EST function annotation by protein cluster selection. In Proceedings of 9th Conference on Artificial Intelligence and Applications (TAAI 2004), 2004
    [54] Mewes HW, Albermann K, Bähr M, Frishman D, Gleissner A, Hani J, Heumann K, Kleine K, Maierl A, Oliver1 SG, Pfeiffer F and Zollner A. Overview of the yeast genome. Nature. 387: 7-8, 1997.
    [55] Dwight SS, Balakrishnan R, Christie KR, Costanzo MC, Dolinski K, Engel SR, Feierbach B, Fisk DG, Hirschman J, Hong EL, Issel-Tarver L, Nash RS, Sethuraman A, Starr B, Theesfeld CL, Andrada R, Binkley G, Dong Q, Lane C, Schroeder M, Weng S, Botstein D and Cherry JM. Saccharomyces genome database: underlying principles and organization. Brief Bioinform. 5: 9-22, 2004.
    [56] Yu H, Han J and Chang KCC. PEBL: web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering. 16:70-81, 2004.
    [57] Liu B, Dai Y, Li X, Lee WS, Yu PS. Building text classifiers using positive and unlabeled examples. Third IEEE International Conference on Data Mining, 2003.
    [58] National center for biotechnology information (NCBI)
    http://www.ncbi.nlm.nih.gov/
    [59] SwissProt http://us.expasy.org/sprot/
    [60] The Gene Ontology http://www.geneontology.org/
    [61] WU-BLAST http://blast.wustl.edu/
    [62] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm., 2001.
    [63] Saccharomyces genome database: http://www.yeastgenome.org/

    無法下載圖示 Full text public date This full text is not authorized to be published. (Intranet public)
    Full text public date This full text is not authorized to be published. (Internet public)
    Full text public date This full text is not authorized to be published. (National library)
    QR CODE