簡易檢索 / 詳目顯示

研究生: 王榮英
Jung-Ying Wang
論文名稱: 蛋白質溶劑可接觸面積之預測與分析
Prediction and Evolutionary Information Analysis of Proteins Solvent Accessibility
指導教授: 李漢銘
Hahn-Ming Lee
口試委員: 許清琦
Ching-Chi Hsu
洪炯宗
Jorng-Tzong Horng
林豐澤
Feng-Tse Lin
錢文南
none
李育杰
Yuh-Jye Lee
鮑興國
Hsing-Kuo Kenneth Pao
學位類別: 博士
Doctor
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 145
中文關鍵詞: 溶劑可接觸面積樣式字典多線性迴歸分析蛋白質結構預測支持向量機
外文關鍵詞: solvent accessibility, look-up table, multiple linear regression, protein structure prediction, support vector machine
相關次數: 點閱:338下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

近年來類神經網路與支持向量機等機器學習理論,被大量的使用於各類生物資訊問題的預測上,雖然可得到較高的準確度,但其最大的問題在於上述理論並不夠通透,無法深入的探究為什麼會得此結果,並進行進一步的分析。如目標殘基與其相鄰殘基間彼此之合作與競爭關係對預測之影響等。故於此論文中我們提出建構樣式字典與多線性迴歸分析法兩個理論,藉助此兩個理論進行一系列的統計分析,將有助於我們能更深入的瞭解,蛋白質殘基間之交互作用對溶劑可接觸面積的影響。
另一方面由於溶劑可接觸面積之預測,至今可分為兩大分支,既傳統的兩類別(或數個類別)的預測及實際值的預測。於本論文中我們提出了一個全新的預測系統,我們將此系統命名為SVM-Cabins。本系統先利用由蛋白質演化資訊所得的位置加權矩陣為特徵值,接著以累進切割集的切割方式,利用某一門檻值來進行傳統的兩類別切割,而後再以支持向量機針對不同的切割集,作溶劑可接觸面積之兩類別的預測,最後我們再將所得的所有切割集的兩類別預測結果,映射成溶劑可接觸面積之實際值。上述系統當我們採用13個不同的累進切割集來做預測,針對Barton502資料集,可達到平均絕對誤差15.1%及相關係數0.66,此預測之準確度為至今最佳的結果。由於本系統先採用傳統的兩類別預測,再導入至實際值之預測,故本系統可同時達到對兩類(既傳統及實際值預測)之最佳化。本系統之理論亦可以利用於任何預測之值介於一數值範圍間的所有問題。


The prediction problem of most machine learning prediction methods (e.g. neural networks, support vector machine etc.) is that they are not transparent. We cannot see into the neural networks or support vector machine to determine why we get a particular prediction. This prevents them to give any insight into the cooperatives or competitions of solvent accessibility of residues and their neighbors, despite their being good accurate predictors. Therefore, in this dissertation we develop methods of look-up tables and multiple linear regression to do the real values prediction of solvent accessibility and provide some important insight into the nearest neighbor effect analysis and evolutionary information analysis.
In addition, a number of methods for predicting levels of solvent accessibility or accessible surface area (ASA) of amino acid residues in proteins have been developed. These methods either predict regularly spaced states of relative solvent accessibility or an analogue real value indicating relative solvent accessibility. In this dissertation, we develop a novel method, named as SVM-Cabins. It first predicts discrete states of ASA of amino acids from their evolutionary profile and then maps the predicted states onto a real valued linear space by simple algebraic methods. The prediction of ASA into larger number of ASA states and then finding a corresponding scheme for real value prediction may be helpful in integrating the two approaches of ASA prediction. Resulting performance of such a rigorous approach using 13-state ASA prediction is better than any reported method of ASA prediction known so far. Since, the method starts with the prediction of discrete states of ASA and leads to real value predictions, performance of prediction in binary states and real values are simultaneously optimized. Also, SVM-Cabins method can be used as a prediction system to predict test data their numerical values, if training data their answers are distributed inside a numerical range.

Content Acknowledgements IV Content V List of Tables X List of Figures XIII Chapter 1 Introduction 1 1.1 Background 3 1.1.1 Amino Acid 4 1.1.2 Overview of Amino Acid Properties 6 1.1.3 Hydrophilic and Hydrophobic 7 1.1.4 Pepides 7 1.1.5 Solvent Accessibility 8 1.1.6 BLAST and PSSM 9 1.2 Motivation 10 1.3 Research Goals 12 1.4 Organization of This Dissertation 13 Chapter 2 Prediction of Solvent Accessibility 14 2.1 History 14 2.2 Solvent Accessibility Prediction Methods 15 2.3 First Generation: Using Single Sequence Data for Solvent Accessibility Prediction 16 2.3.1 Bayesian Probabilistic Method 17 2.3.2 Information Theory 17 2.3.3 The Logistic Function 17 2.4 Second Generation: Using Evolutionary Information Data for Solvent Accessibility Prediction 18 2.4.1 Neural Networks 18 2.4.2 Fuzzy k-nearest Neighbor Method 20 2.4.3 Support Vector Regression (SVR) 20 2.4.5 Hybrid Prediction Approach 22 2.5 Quadratic Programming Method 23 Chapter 3 Prediction and Analysis by Using Look-up Table and Multiple Linear Regression 24 3.1 Using Look-up Tables 25 3.1.1 Development of Look-up Tables 26 3.1.2 Prediction from Look-up Tables 27 3.2 Using Multiple Linear Regression 28 3.2.1 Coding Scheme 29 3.2.2 Method of Multiple Linear Regression 30 3.2.3 Evolutionary Information 31 3.3 Data Selection 32 3.4 Assessment of Prediction Performance 33 3.5 Real Value Prediction from PHD 35 Chapter 4 Prediction and Analysis by Using Accumulation Cutoff Set and Support Vector Machine 36 4.1 The Goal of Using SVM-Cabins to Prediction the Protein Solvent Accessibility 37 4.2 Basic Concept of Support Vector Machine 38 4.3 For Multi-class Support Vector Machine 41 4.3.1 One-against-all Method 41 4.3.2 One-against-one Method 42 4.4 Data Sets Using in SVM-Cabins 43 4.5 Coding Scheme 44 4.6 Unbalance Data and Crisp Cutoff Set Problem in Using Multi-class Classification Methods to Assign Numerical Values of Solvent Accessibility 44 4.7 Accumulation Cutoff Set 47 4.8 Software and Model Selection 51 4.9 Accuracy for the 13 Binary SVM Models 52 4.10 Main Binary Output Patterns from the 13 Binary-class SVM Models 55 4.11 Algorithms to Transfer a Number of Binary SVM Prediction Results to Numerical Values of Solvent Accessibility 56 4.12 Assessment of Prediction Performance 59 Chapter 5 Results and Discussion 61 5.1 Prediction Results from Look-up Tables 61 5.1.1 Analysis Results from Look-up Tables 61 5.1.2 Predictions Using Look-up Tables 71 5.2 Prediction Results from Multiple Linear 77 5.2.1 Analysis of Sequence and Evolutionary Information 77 5.2.2 Results for Multiple Linear Regression 78 5.2.3 Variation of Prediction Error with ASA Value Range 81 5.2.4 Residue-specific Variation in Prediction Error 82 5.2.5 Effect of Protein Chain Length on Mean Absolute Error 84 5.2.6 Effect of Alignment Coverage and Number of Iterations 85 5.2.7 Effect of Sequence Neighbor Information on Prediction Accuracy 87 5.3 Prediction Results for SVM-Cabins 97 5.3.1 Prediction Performance on Rost126 and Barton502 Datasets 97 5.3.2 Variation in Prediction Error with ASA Value Range 99 5.3.3 Residue-specific Variation in Prediction Error 100 5.3.4 Effect of Protein Chain Length on Mean Absolute Error 103 5.3.5 Comparison with Results of Other Methods 103 5.3.6 Real Value Predictions from Other Binary Predictors on the Web 104 5.3.7 Implications to Protein Structure 105 Chapter 6 Conclusions 108 6.1 Conclusion of Using Look-up Tables 108 6.2 Conclusion of Using Multiple Linear Regression 108 6.3 Conclusion of Using SVM-Cabins 109 6.4 Further Work 110 References 112 Appendix A Benchmark Datasets 122 Appendix B List of Publications 140

References
Adamczak R, Porollo A, Meller J. Accurate prediction of solvent accessibility using neural networks-based regression. Proteins 2004; 56(4): 753-67.
Ahmad S, Gromiha MM: NETASA: Neural network based prediction of solvent accessibility. Bioinformatics 2002; 18: 819-824.
Ahmad S, Gromiha MM, Sarai A: Real-value prediction of solvent accessibility from amino acid sequence. Proteins 2003; 50: 629-635.
Ahmad S, Gromiha MM, Sarai A. Analysis and prediction of DNA-binding proteins and their binding residues based on composition sequence and structural information. Bioinformatics 2004; 20(4): 477-486.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215: 403-410.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25: 3389-3402.
Andrade MA, O’Donoghue SI, Rost B. Adaptation of protein surface to subcellular location. J Mol Biol 1998; 276: 517-525.
Baumann G, Froömmel C, Sander C. Polarity as a criterion in protein design. Protein Eng. 1989; 2: 329-334.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000; 28: 235-242.
Betts MJ, Russell RB. Amino acid properties and consequences of subsitutions. In Bioinformatics for Geneticists. M.R. Barnes, I.C. Gray eds, Wiley, 2003.
Boser B,Guyon I,Vapnik V. A Training Algorithm for Optimal Margin Classifiers In Proceedings of the 5th Annul ACM Workshop on Computational Learning Theory. ACM Press, 1992; 144-152.
Both GW, Sleigh MJ. Complete nucleotide sequence of the haemagglutinin gene from a human influenza virus of the Hong Kong subtype. Nucleic Acids Res 1980; 8: 2561-2575.
Carugo O. Prediction of protein polypeptide fragments exposed to the solvent. In Silico Biology 2003; 3: 417-428.
Carugo O. Prediction residue solvent accessibility from protein sequence by considering the sequence environment. Protein Eng 2000; 13: 607-609.
Chan HS, Dill KA. Origins of structures in globular proteins. Proc Natl Acad Sci USA 1990; 87: 6388-6392.
Chang C –C, Lin C –J. LIBSVM: a library for support vector machines, 2006. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chou KC, Zhang CT Prediction of protein structural classes. Crit Rev Biochem Mol Biol 1995; 30: 275-349.
Connolly ML, O'Donnell TJ, Warde S. Special issue on molecular surfaces. Network Science, 2(4), April 1996.
Connolly ML. Molecular surfaces: A review. Network Science 1996; 2, http://www.awod.com/netsci/Science/Compchem/feature14.html.
Cornette JL, Cease KB, Margalit H, Spouge JL, Berzofsky JA, DeLisi C. Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J Mol Biol 1987; 195: 659-685.
Cortes C, Vapnik V. Support-vector network. Machine Learning 1995; 20: 273–297.
Cuff J A, Clamp M E, Siddiqui A S, Finlay M, Barton GJ. Jpred: A consensus secondary structure prediction server. Bioinformatics 1998; 14: 892-893.
Cuff JA, Barton GJ. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 2000; 40: 502-511.
DeCoste D, Scholkopf B. Training invariant support vector machines. Machine Learning 2002; 46: 161–190.
Engelman DM, Steitz TA, Goldman A. Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Chem 1986; 15: 321-353.
Ehhrlich L, Reczko M, Bohr H, Wade RC. Prediction of protein hydration sites from sequence by modular neural networks. Protein Eng 1998; 11: 11-19.
Eisenberg D, Schwarz E, Komaromy M, Wall R. Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J Mol Biol 1984; 179: 125-142.
Eisenhaber F, Argos P. Hydrophobic regions on protein surfaces: Definition based on hydration shell structure and a quick method for their computation. Protein Eng. 1996; 9: 1121-1133.
Eyal E, Najmanovich R, McConkey BJ, Edelman M, Sobolev V. Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins. J Comput Chem 2004; 25(5): 712-724.
Fariselli P, Casadio R. RCNPRED: prediction of the residue co-ordination numbers in proteins. Bioinformatics 2001; 17: 202-204.
Friedman J. Another approach to polychotomous classification. Technical report 1996; Stanford University.
Gaboriaud C, Bissery V, Benchetrit T, Mornon JP. Hydrophobic cluster analysis: an efficient new way to compare and. analyse amino acid sequences. FEBS Lett 1987; 224: 149-55.
Garg A, Kaur H, Raghava GP. Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure. Proteins 2005; 61: 318-324.
Garnier J, Osguthorpe DJ, Robson B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 1978; 120: 97-120.
Gianese G, Bossa F, Pascarella S. Improvement in prediction of solvent accessibility by probability profiles. Protein Eng 2003; 16: 987-992.
Gianese G, Bossa F, Pascarella S. A consensus procedure improving solvent accessibility prediction. J Comput Chem 2006; 27: 621-626.
Gibrat JF, Garnier J, Robson B. Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J Mol Biol 1987; 198: 425-443.
Ginalski K, Rychlewski L. Protein structure prediction of CASP5 comparative modeling and fold recognition targets using consensus alignment approach and 3D assessment. Proteins 2003; 53 (Suppl 6): 410-417.
Gordon AH, Martin AJP, Synge RLM. Partition chromatography in the study of protein constituents. Biochem J 1943; 37(1): 79–86.
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992; 89: 10915–10919.
Holbrook SR, Muskal SM, Kim SH. Predicting surface exposure of amino acids from protein sequences. Protein Eng 1990; 3: 659-665.
Hopp TP, Woods KR. Prediction of protein anti-genic determinants from amino acid sequences. Proc Natl Acad Sci USA 1981; 78: 3824-3828.
Hsu C.-W. and Lin C.-J. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 2002; 13: 415-425.
Huang Y. Li Y. Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics 2004; 20: 21–28.
Joachims T. Transductive inference for text classification using support vector machines. In: Bratko I, Dzeroski S, eds. Proc. of the 16th Int'l Conf. on Machine Learning (ICML-99). Bled: Morgan Kaufmann Publishers. 1999; 200-209.
Joachims T. The Maximum-Margin Approach to Learning Text Classifiers: Methods, Theory, and Algorithms. 2000, Ph.D. thesis, University of Dortmund.
John B, Sali A. Detection of homologous proteins by an intermediate sequence search. Protein Sci 2004; 13: 54–62.
Kabsch, W. and Sander, C. Dictionnary of protein secondary structure : Pattern recognition of hydrogen bonded and geometrical features. Biopolymers 1983; 22: 2577-2637.
Keller JM, Gray MR, Givens JA. A fuzzy k-nearest neighbor algorithm. IEE Trans Syst Man Cybern 1985; 15: 580–585.
Kim H, Park H. Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins 2004; 54: 557-62.
Kloczkowski A, Ting K -L, Jernigan RL, Garnier J. Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 2002; 49: 154-166.
Knerr S, Personnaz L, Dreyfus G. Single layer learning revisited: a stepwise procedure for building and training a neural network. Neurocomputing: Algorithms, Architectures and Applications. J. Fogelman (Ed.), SpringerVerlag, 1990.
Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982; 157: 105-132.
Lemesle-Varloot L, Henrissat B, Gaboriaud C, Bissery V, Morgat A, Mornon JP. Hydrophobic cluster analysis: procedures to derive structural and functional information from 2-D-representation of protein sequences. Biochimie 1990; 72: 555-74.
Li X, Pan X –M. New method for accurate prediction of solvent accessibility from protein sequence. Proteins 2001; 42: 1-5.
Lijnzaad P, Berendsen HJC, Argos P. A method for detecting hydrophobic patches on protein surfaces. Proteins 1996; 26: 192-203.
Lin C.-J.: Formulations of support vector machines: a note from an optimization point of view. Neural Computation 2001; 13(2): 307-317.
Macdonald JR, Johnson WC. Environmental features are important in determining protein secondary structure. Protein Sci 2001; 10: 1172-1177.
Manesh HN, Sadeghi M, Arab S, Movahedi AM. Prediction of. protein surface accessibility with information theory. Proteins 2001; 42: 452–459.
Martin AJP, Synge RLM. A new form of chromatogram employing two liquide phases.1. A theory of chromatography. 2. Application of the microdetermination of the higher monoaminoacids in proteins. Biochemistry Journal 1941; 35: 1358-1368.
Moult J, Pedersen J, Judson R, Fidelis K. A large-scale experiment to assess protein structure prediction methods. Proteins 1995; 23: ii--v.
Moult J; Fidelis K, Zemla A, Hubbard T. Critical assessment of methods of protein structure prediction (CASP)-Round V. Proteins 2003; 53: 334-339.
Mucchielli-Giorgi MH, Hazout S, Tuffery P. PredAcc: prediction of protein solvent accessibility. Bioinformatics 1999; 15: 172-176.
Naderi-Manesh H, Sadeghi M, Arab S, Movahedi AA. Prediction of protein surface accessibility with information theory. Proteins 2001; 42: 452-459.
Nguyen MN, Rajapakse JC. Prediction of protein relative solvent accessibility with a two-stage SVM approach. Proteins 2005; 59: 30-37.
Nguyen MN, Rajapakse JC. Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins. 2006; 63: 542–550.
Pollastri G, Baldi P, Fariselli P, Casadio R. Prediction of coordination number and relative solvent accessibility in proteins. Proteins 2002; 47: 142-153.
Qian N, Senjowsky TJ. Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 1998; 202: 865–884.
Richardson CJ, Barlow DJ. The bottom line for prediction of residue solvent accessibility. Protein Eng 1999; 12: 1051-1054.
Rost B. PHD predicting one-dimensional protein structure by profile based neural networks. Meth in Enzym 1996; 266: 525-539.
Rost B, Liu J. The predict protein server. Nucleic Acids Res 2003; 31: 3300-3304.
Rost B, Sander C. Prediction of protein secondary structure at better than 70 % Accuracy. Journal of Molecular Biology 1993; 232:584-599.
Rost B, Sander C. Improved prediction of protein secondary structure by using sequence profiles and neural networks. Proc Natl Acad Sci 1993; 90: 7558-7562.
Rost B, Sander C. Conservation and prediction of solvent accessibility in protein families. Proteins 1994; 20: 216-226.
Russell SJ, Blandl T, Skelton NJ, Cochran AG. Stability of cyclic beta -hairpins: asymmetric contributions from side chains of a hydrogen-bonded cross-strand residue pair. J Am Chem Soc 2003; 125: 388-395.
Samanta U, Bahadur RP, Chakrabarti P. Quantifying the accessible surface area of protein residues in their local environment. Protein Engng 2002; 15: 659-667.
Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991; 9(1): 56-68.
Schlkopf B, Burges C, Smola A. Advances in kernel methods - support vector learning, chapter Introduction to support vector learning. Chap.1. MIT Press, 1999.
Sen TZ, Jernigan RL, Garnier J, Kloczkowski A. GOR V server for protein secondary structure prediction. Bioinformatics 2005; 21: 2787-2788.
Sim J, Kim S -Y, Lee J. Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method. Bioinformatics 2005; 21: 2844-2849.
Soppl MJ. Recognition of errors in three-dimensional structures of proteins. Proteins 1993; 17: 355-362.
Synge RLM. Partial hydrolysis products derived from proteins and their significance for protein structure. Chem Rev 1943; 32: 135-172.
Taylor WR. The classification of amino acid conservation. J Theor Biol 1986; 119: 205-218.
Thompson MJ, Goldstein RA. Prediction solvent accessibility: higher accuracy using Bayesian statistics and optimized residue substitution classes. Proteins 1996; 25: 38-47.
Totrov, M. Accurate and efficient generalized born model based on solvent accessibility: Derivation and application for LogP octanol/water prediction and flexible peptide docking. Journal of Computational Chemistry 2004; 25(4): 609-619.
Vapnik, V. Statistical Learning Theory. Wiley, New York, 1998.
Wang J -Y, Ahmad S, Gromiha MM and Sarai A. Look-up tables for protein solvent accessibility prediction and nearest neighbor effect analysis. Biopolymers 2004; 75: 209-216.
Wang J -Y, Lee H –M, Ahmad S. Prediction and evolutionary information analysis of proteins solvent accessibility using multiple linear regression". Proteins 2005; 61: 481-491.
Wang J -Y, Lee H –M, Ahmad S. SVM-Cabins: A Novel Method for Numerical Value Prediction of Solvent Accessibility Using Accumulation Cutoff Set and Support Vector Machine. Proteins 2007; to appear.
Wohlfahrt G, Hangoc V, Schomburg D. Positioning of anchor groups in protein loop prediction: the importance of solvent accessibility and secondary structure elements. Proteins 2002; 47: 370–378.
Xu Z, Zhang C, Liu S, Zhou Y. QBES: Predicting real values of solvent accessibility from sequences by efficient, constrained energy optimization. Proteins 2006; 63: 961-966.
Yuan Z, Burrage K, Mattick JS. Prediction of protein solvent accessibility using support vector machines. Proteins 2002; 48: 566–570.
Yuan Z, Huang B. Prediction of protein accessible surface areas by support vector regression. Proteins 2004; 57: 558-564.
Zhou H, Zhou Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Science 2002; 11: 2714-2726.

QR CODE