Author: 劉承剛
Cheng-Kang Liu
Thesis Title: 以序列比對為基礎並應用分類器技術擷取並整合生物資訊來源之蛋白質序列註解系統
Improving SIM-based annotation method of protein sequence using support vector machine
Advisor: 李漢銘
Hahn-Ming Lee
Committee: 何正信
Cheng-Seen Ho
Jan-Ming Ho
Chi-Chun Huang
Hsing-Kuo Pao
Degree: 碩士
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2005
Graduation Academic Year: 93
Language: 英文
Pages: 78
Keywords (in Chinese): 分類功能註解蛋白質序列生物資訊
Keywords (in other languages): Bioinformatics, Protein sequence, Function annotation, Classification
  • 蛋白質註解是幫助生物學家了解蛋白質功能的重要資訊,然而隨著近年來生物序列定序技術的蓬勃發展,大量的蛋白質序列被定序完成,傳統使用人力來產生註解的方式已無法應付快速產生的序列資料,利用電腦技術來快速自動產生蛋白質註解成為生物資訊學上重要的課題。此外這些已知功能的序列和其註解資訊往往分布在不同的生物資訊資料庫,為了查詢所需資料,常要透過許多不同的網站和資料庫才能蒐集齊全,因此如何整合這些資料,使研究人員能更簡單地進行資料的搜尋,並從中獲得有價值的資訊亦為生物資訊上另一個重要的課題。

    The gap between the protein sequences and the reliable function annotation in public databases is growing. Traditional manual annotation by literature curation can not catch up with the rapid growth of new protein sequences. Thus, the automatic annotation methods of protein sequences are in great demand are in great demand. Sequence similarity (SIM) methods, such as BLAST, are the most common used method which searching for homologies and evolutionary relationship between the protein sequences. However, there are a considerable number of functional inconsistencies in similar protein sequences. Thus, a method to automatic eliminates the error annotations is needed to improve the SIM-based methods. In addition, the biological data are distributed in different databases and having their own data types. It is difficult for users to obtain these data they needed from the distributed environment. Integration of the various types of biological data into an integrated environment for function annotation of protein sequences is also an important issue.
    In this paper, we present a protein sequence annotation method, named as MAPS (Multiple Annotation for Protein Sequences), which provides a mechanism to extract multiple annotations from various types of biological data and automatic eliminates the error annotations by a pre-trained SVM classifier. It assigns an annotation to the input protein sequence by taking into account all hit proteins with this annotation entirely, not only single hit protein. This can reduce the error annotations inferred from weak sequence similarity and the sequences identity in non-functional segment. The experimental results show that the error annotations can be eliminated effectively and keep high accuracy on different types of annotations.

    Abstract II Acknowledgements IV Content V List of Tables VIII List of Figures IX Chpater 1 Introduction 1 1.1 Motivation 1 1.2 The Challenges of protein function annotation 2 1.2.1 Ambiguous cutting threshold 2 1.2.2 Error annotations 3 1.2.3 Distribution of data sources 3 1.3 Goals 4 1.4 Outline of this thesis 4 Chpater 2 Background 5 2.1 Protein function annotation 5 2.1.1 Current automatic annotation methods 6 2.2 Tools in MAPS 10 2.2.1 SwissProt 10 2.2.2 InterPro: integrated protein documentation resources 11 2.2.3 GO: gene ontology 13 2.2.4 BLAST 14 2.2.5 SVM: support vector machine 16 Chpater 3 MAPS 21 3.1 Overview of MAPS 21 3.2 System architecture of MAPS 24 3.2.1 Similar protein sequence searching unit 25 3.2.2 Annotation-based protein clustering unit 26 3.2.3 Protein cluster selecting and evaluating unit 27 3.2.4 GO annotation searching unit 29 3.3 Cluster features and domain matching score 31 3.3.1 The features of protein cluster 32 3.3.2 Supporting score 32 3.3.3 Similarity score 33 3.3.4 Domain matching score 34 3.4 Characteristics of MAPS 36 Chpater 4 Experiments 38 4.1 Experimental data 38 4.2 Experimental setup 40 4.3 Experimental results 41 4.3.1 Signature protein cluster selection 42 4.3.2 Keyword protein cluster selection 45 4.3.3 Domain matching score 48 4.3.4 Comparison between MAPS and BLAST 50 4.3.5 Comparison between MAPS and InterProScan 50 4.3.6 Agreement of GO annotations between MAPS and SGD 51 Chpater 5 Conclusion and further work 54 5.1 Discussion 54 5.2 Conclusion 56 5.3 Further work 57 References 59

