簡易檢索 / 詳目顯示

研究生: Fathurrochman Habibie
Fathurrochman Habibie
論文名稱: GA-based Feature Selection for Protein Secondary Structure Prediction
GA-based Feature Selection for Protein Secondary Structure Prediction
指導教授: 呂永和
Yungho Leu
口試委員: 楊維寧
Wei-Ning Yang
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 87
中文關鍵詞: Feature SelectionGenetic AlgorithmProtein Secondary Structure PredictionCNNBLSTM
外文關鍵詞: Feature Selection, Genetic Algorithm, Protein Secondary Structure Prediction, CNN, BLSTM
相關次數: 點閱:231下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Proteins are essential macromolecules for the structure and function of a cell. Interactions of proteins are responsible for controlling various vital functions in the body, such as having a role in activating the immune system, regulating oxygenation, and determining drug response. Usually, the secondary structure of a protein can be determined through experimental methods (e.g., X-ray crystallography, NMR). However, the experimental methods are very expensive, time-consuming, and require a complex procedure. Hence, the computational approach for predicting the secondary structure of a protein is important in the biology field.

    The secondary structure of a protein can be determined by its constituent sequence of amino acids. Usually, the protein secondary structure prediction uses two fixed features: amino acid sequences and PSSM profiles. However, other additional protein features (e.g., biophysical, physicochemical, conformation scores) can improve protein secondary structure prediction accuracy. This thesis focuses on feature selection to improve the accuracy in predicting the secondary structure of a protein. In this thesis, we first proposed to use a CNN model and a genetic algorithm to find the optimal subset of features. Then we trained a CNN-BLSTM model using the selected features to archive 74.5% Q8 accuracy on the CB513 dataset.


    Proteins are essential macromolecules for the structure and function of a cell. Interactions of proteins are responsible for controlling various vital functions in the body, such as having a role in activating the immune system, regulating oxygenation, and determining drug response. Usually, the secondary structure of a protein can be determined through experimental methods (e.g., X-ray crystallography, NMR). However, the experimental methods are very expensive, time-consuming, and require a complex procedure. Hence, the computational approach for predicting the secondary structure of a protein is important in the biology field.

    The secondary structure of a protein can be determined by its constituent sequence of amino acids. Usually, the protein secondary structure prediction uses two fixed features: amino acid sequences and PSSM profiles. However, other additional protein features (e.g., biophysical, physicochemical, conformation scores) can improve protein secondary structure prediction accuracy. This thesis focuses on feature selection to improve the accuracy in predicting the secondary structure of a protein. In this thesis, we first proposed to use a CNN model and a genetic algorithm to find the optimal subset of features. Then we trained a CNN-BLSTM model using the selected features to archive 74.5% Q8 accuracy on the CB513 dataset.

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . 1 1.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.2. Problem Formulation . . . . . . . . . . . . . . . . . . . . 2 1.3. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4. Scope and Limitation . . . . . . . . . . . . . . . . . . . . 3 1.5. Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6. Research Outline . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . 6 Chapter 3 Theoretical Basis . . . . . . . . . . . . . . . . .10 3.1. Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2. Protein Structure . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3. Protein Structure Prediction . . . . . . . . . . . . . . . 11 3.4. Protein Features . . . . . . . . . . . . . . . . . . . . . . . . . . .13 3.4.1. Amino Acid Sequences Features . . . . . . . . . . . 13 3.4.2. Position-Specific Scoring Matrix (PSSM) . . . .14 3.4.3. Physical Features . . . . . . . . . . . . . . . . . . . . . . . . . .14 3.4.4. Conformation Parameters . . . . . . . . . . . . . . . . . .14 3.5. Deep Learning Network . . . . . . . . . . . . . . . . . . . . . . . .15 3.5.1. Convolutional Neural Network (CNN) . . . . . . 15 3.5.2. Long Short Term Memory (LSTM) . . . . . . . . . . 16 3.5.3. Bi-directional LSTM (BLSTM) . . . . . . . . . . . . . . . 18 3.5.4. Attention Model . . . . . . . . . . . . . . . . . . . . . . . . . . .18 3.6. Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 4 Proposed Method . . . . . . . . . . . . . . . . . . . . . 21 4.1. System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 4.2. Dataset Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22 4.3. System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4. GA-based Feature Selection . . . . . . . . . . . . . . . . . . . . 30 4.4.1. Chromosome Representation . . . . . . . . . . . . . . 30 4.4.2. Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.3. Tournament Selection . . . . . . . . . . . . . . . . . . . . . 33 4.4.4. Crossover and Mutation . . . . . . . . . . . . . . . . . . . .34 4.5. Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5.1. CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5.2. CNN-BLSTM Architecture . . . . . . . . . . . . . . . . . . .37 4.5.3. CNN-BLSTM with Attentions Layer . . . . . . . . . . 38 4.5.4. CNN-BLSTM with Highway Connections . . . . .38 4.6. Training Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.7. Evaluation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 5 Implementation . . . . . . . . . . . . . . . . . . . . . . . .43 5.1. Implementation Environment . . . . . . . . . . . . . . . . . . . 43 5.2. Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 5.3. Architecture Implementation . . . . . . . . . . . . . . . . . . . .47 5.4. CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48 5.4.1. CNN-BLSTM Architecture . . . . . . . . . . . . . . . . . . . 48 5.4.2. CNN-BLSTM with Attentions . . . . . . . . . . . . . . . . 49 5.4.3. CNN-BLSTM with Highway . . . . . . . . . . . . . . . . . .50 5.4.4. Genetic Algorithm Implementation . . . . . . . . . .51 5.5. Evaluation Implementation . . . . . . . . . . . . . . . . . . . . . .52 Chapter 6 Experimental Result . . . . . . . . . . . . . . . . . . . . 56 6.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2. Finding Optimal Feature Subset . . . . . . . . . . . . . . . . .57 6.2.1. Experimental Parameters . . . . . . . . . . . . . . . . . . . 57 6.2.2. Single Objective GA . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2.3. Multiobjective GA . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.4. Optimal Feature Subsets Result . . . . . . . . . . . . . 60 6.3. Evaluation on Optimal Feature Subset . . . . . . . . . . . 61 6.4. Comparison with previous studies . . . . . . . . . . . . . . .63 Chapter 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.2. Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 67 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A1. Predicted Result on CASP10 T0711. . . . . . . . .71 A2. Predicted Result on CASP10 T0651-D3. . . . .72 A3. Predicted Result on CASP10 T0716. . . . . . . . 73 A4. Predicted Result on CASP10 T0726-D2. . . . 74 A5. Predicted Result on CASP10 T0685-D2. . . . 75

    [1] J. C. KENDREW et al., “Structure of myoglobin: A three-dimensional Fourier synthesis at 2 A. resolution.,” Nature, vol. 185, no. 4711, pp. 422–427, Feb. 1960, doi: 10.1038/185422a0.
    [2] M. Jacobson and A. Sali, “Comparative protein structure modeling and its applications to drug discovery,” Annu. Rep. Med. Chem, vol. 39, no. 85, pp. 259–274, 2004.
    [3] C. B. Anfinsen, “Principles that govern protein folding,” Science (80-. )., vol. 181, no. 4096, pp. 223–230, 1973.
    [4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2323, 1998, doi: 10.1109/5.726791.
    [5] L. M. Q. De Santana, R. M. Santos, L. N. Matos, and H. T. Macedo, “Deep Neural Networks for Acoustic Modeling in the Presence of Noise,” IEEE Lat. Am. Trans., vol. 16, no. 3, pp. 918–925, 2018, doi: 10.1109/TLA.2018.8358674.
    [6] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” pp. 1–9, 2013, [Online]. Available: http://arxiv.org/abs/1312.5602.
    [7] S. K. Sønderby and O. Winther, “Protein Secondary Structure Prediction with Long Short Term Memory Networks,” 2014, [Online]. Available: http://arxiv.org/abs/1412.7828.
    [8] Z. Li and Y. Yu, “Protein secondary structure prediction using cascaded convolutional and recurrent neural networks,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2016-Janua, pp. 2560–2567, 2016.
    [9] E. Asgari, N. Poerner, A. C. McHardy, and M. R. K. Mofrad, “DeepPrime2Sec: Deep learning for protein secondary structure prediction from the primary sequences,” bioRxiv, 2019, doi: 10.1101/705426.
    [10] J. Zhou and O. G. Troyanskaya, “Deep supervised and convolutional generative stochastic network for protein secondary structure prediction,” 31st Int. Conf. Mach. Learn. ICML 2014, vol. 2, pp. 1121–1129, 2014.
    [11] Z. Lin, J. Lanchantin, and Y. Qi, “MUST-CNN: A multilayer shift-And-stitch deep convolutional architecture for sequence-based protein structure prediction,” 30th AAAI Conf. Artif. Intell. AAAI 2016, no. Jones 1999, pp. 27–34, 2016.
    [12] A. Busia, J. Collins, and N. Jaitly, “Protein Secondary Structure Prediction Using Deep Multi-scale Convolutional Neural Networks and Next-Step Conditioning,” pp. 1–10, 2016, [Online]. Available: http://arxiv.org/abs/1611.01503.
    [13] S. Wang, J. Peng, J. Ma, and J. Xu, “Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields,” Sci. Rep., vol. 6, pp. 109–114, 2016, doi: 10.1038/srep18962.
    [14] Y. F. Huang and S. Y. Chen, “Extracting physicochemical features to predict protein secondary structure,” Sci. World J., vol. 2013, 2013, doi: 10.1155/2013/347106.
    [15] C. L. Huang and C. J. Wang, “A GA-based feature selection and parameters optimizationfor support vector machines,” Expert Syst. Appl., vol. 31, no. 2, pp. 231–240, 2006, doi: 10.1016/j.eswa.2005.09.024.
    [16] A. W. Fischer et al., “CASP11 - An evaluation of a modular BCL: Fold-based protein structure prediction pipeline,” PLoS One, vol. 11, no. 4, pp. 1–19, 2016, doi: 10.1371/journal.pone.0152517.
    [17] Y. Qi, M. Oja, J. Weston, and W. S. Noble, “A unified multitask architecture for predicting local protein properties,” PLoS One, vol. 7, no. 3, p. e32235, 2012.
    [18] A. Drozdetskiy, C. Cole, J. Procter, and G. J. Barton, “JPred4: a protein secondary structure prediction server,” Nucleic Acids Res., vol. 43, no. W1, pp. W389--W394, 2015.
    [19] J. Zhou, H. Wang, Z. Zhao, R. Xu, and Q. Lu, “CNNH_PSS: Protein 8-class secondary structure prediction by convolutional neural network with highway,” BMC Bioinformatics, vol. 19, no. Suppl 4, 2018, doi: 10.1186/s12859-018-2067-8.
    [20] B. Zhang, J. Li, and Q. Lü, “Prediction of 8-state protein secondary structures by a novel deep learning architecture,” BMC Bioinformatics, vol. 19, no. 1, pp. 1–13, 2018, doi: 10.1186/s12859-018-2280-5.
    [21] E. Asgari and M. R. K. Mofrad, “Protvec: A continuous distributed representation of biological sequences,” Comput. Sci., vol. 10, no. 11, p. e0141287.
    [22] M. E. Peters et al., “Deep contextualized word representations,” arXiv Prepr. arXiv1802.05365, 2018.
    [23] J. Pevsner, Bioinformatics and Functional Genomics, vol. 3, no. 2. 2004.
    [24] R. Schleif, Genetics and molecular biology, vol. 23, no. 2. 1993.
    [25] C. H. Yang, Y. S. Lin, L. Y. Chuang, and Y. Da Lin, “Effective hybrid approach for protein structure prediction in a two-dimensional Hydrophobic–Polar model,” Comput. Biol. Med., vol. 113, no. August, p. 103397, 2019, doi: 10.1016/j.compbiomed.2019.103397.
    [26] J. Meiler, M. Müller, A. Zeidler, and F. Schmäschke, “Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks,” J. Mol. Model., vol. 7, no. 9, pp. 360–369, 2001, doi: 10.1007/s008940100038.
    [27] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
    [28] H. Sak, A. Senior, and F. Beaufays, “Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition,” no. Cd, 2014, [Online]. Available: http://arxiv.org/abs/1402.1128.
    [29] C. Olah, “Understanding LSTM Networks,” 2015. https://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed May 20, 2020).
    [30] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, 1997, doi: 10.1109/78.650093.
    [31] A. V Joshi, Machine learning and artificial intelligence. Springer, 2020.
    [32] S. F. Altschul et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res., vol. 25, no. 17, pp. 3389–3402, 1997.
    [33] B. L. Miller, D. E. Goldberg, and others, “Genetic algorithms, tournament selection, and the effects of noise,” Complex Syst., vol. 9, no. 3, pp. 193–212, 1995.
    [34] A. E. Eiben, J. E. Smith, and others, Introduction to evolutionary computing, vol. 53. Springer, 2003.

    QR CODE