研究生: |
Fathurrochman Habibie Fathurrochman Habibie |
---|---|
論文名稱: |
GA-based Feature Selection for Protein Secondary Structure Prediction GA-based Feature Selection for Protein Secondary Structure Prediction |
指導教授: |
呂永和
Yungho Leu |
口試委員: |
楊維寧
Wei-Ning Yang 陳雲岫 Yun-Shiow Chen |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 英文 |
論文頁數: | 87 |
中文關鍵詞: | Feature Selection 、Genetic Algorithm 、Protein Secondary Structure Prediction 、CNN 、BLSTM |
外文關鍵詞: | Feature Selection, Genetic Algorithm, Protein Secondary Structure Prediction, CNN, BLSTM |
相關次數: | 點閱:231 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Proteins are essential macromolecules for the structure and function of a cell. Interactions of proteins are responsible for controlling various vital functions in the body, such as having a role in activating the immune system, regulating oxygenation, and determining drug response. Usually, the secondary structure of a protein can be determined through experimental methods (e.g., X-ray crystallography, NMR). However, the experimental methods are very expensive, time-consuming, and require a complex procedure. Hence, the computational approach for predicting the secondary structure of a protein is important in the biology field.
The secondary structure of a protein can be determined by its constituent sequence of amino acids. Usually, the protein secondary structure prediction uses two fixed features: amino acid sequences and PSSM profiles. However, other additional protein features (e.g., biophysical, physicochemical, conformation scores) can improve protein secondary structure prediction accuracy. This thesis focuses on feature selection to improve the accuracy in predicting the secondary structure of a protein. In this thesis, we first proposed to use a CNN model and a genetic algorithm to find the optimal subset of features. Then we trained a CNN-BLSTM model using the selected features to archive 74.5% Q8 accuracy on the CB513 dataset.
Proteins are essential macromolecules for the structure and function of a cell. Interactions of proteins are responsible for controlling various vital functions in the body, such as having a role in activating the immune system, regulating oxygenation, and determining drug response. Usually, the secondary structure of a protein can be determined through experimental methods (e.g., X-ray crystallography, NMR). However, the experimental methods are very expensive, time-consuming, and require a complex procedure. Hence, the computational approach for predicting the secondary structure of a protein is important in the biology field.
The secondary structure of a protein can be determined by its constituent sequence of amino acids. Usually, the protein secondary structure prediction uses two fixed features: amino acid sequences and PSSM profiles. However, other additional protein features (e.g., biophysical, physicochemical, conformation scores) can improve protein secondary structure prediction accuracy. This thesis focuses on feature selection to improve the accuracy in predicting the secondary structure of a protein. In this thesis, we first proposed to use a CNN model and a genetic algorithm to find the optimal subset of features. Then we trained a CNN-BLSTM model using the selected features to archive 74.5% Q8 accuracy on the CB513 dataset.
[1] J. C. KENDREW et al., “Structure of myoglobin: A three-dimensional Fourier synthesis at 2 A. resolution.,” Nature, vol. 185, no. 4711, pp. 422–427, Feb. 1960, doi: 10.1038/185422a0.
[2] M. Jacobson and A. Sali, “Comparative protein structure modeling and its applications to drug discovery,” Annu. Rep. Med. Chem, vol. 39, no. 85, pp. 259–274, 2004.
[3] C. B. Anfinsen, “Principles that govern protein folding,” Science (80-. )., vol. 181, no. 4096, pp. 223–230, 1973.
[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2323, 1998, doi: 10.1109/5.726791.
[5] L. M. Q. De Santana, R. M. Santos, L. N. Matos, and H. T. Macedo, “Deep Neural Networks for Acoustic Modeling in the Presence of Noise,” IEEE Lat. Am. Trans., vol. 16, no. 3, pp. 918–925, 2018, doi: 10.1109/TLA.2018.8358674.
[6] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” pp. 1–9, 2013, [Online]. Available: http://arxiv.org/abs/1312.5602.
[7] S. K. Sønderby and O. Winther, “Protein Secondary Structure Prediction with Long Short Term Memory Networks,” 2014, [Online]. Available: http://arxiv.org/abs/1412.7828.
[8] Z. Li and Y. Yu, “Protein secondary structure prediction using cascaded convolutional and recurrent neural networks,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2016-Janua, pp. 2560–2567, 2016.
[9] E. Asgari, N. Poerner, A. C. McHardy, and M. R. K. Mofrad, “DeepPrime2Sec: Deep learning for protein secondary structure prediction from the primary sequences,” bioRxiv, 2019, doi: 10.1101/705426.
[10] J. Zhou and O. G. Troyanskaya, “Deep supervised and convolutional generative stochastic network for protein secondary structure prediction,” 31st Int. Conf. Mach. Learn. ICML 2014, vol. 2, pp. 1121–1129, 2014.
[11] Z. Lin, J. Lanchantin, and Y. Qi, “MUST-CNN: A multilayer shift-And-stitch deep convolutional architecture for sequence-based protein structure prediction,” 30th AAAI Conf. Artif. Intell. AAAI 2016, no. Jones 1999, pp. 27–34, 2016.
[12] A. Busia, J. Collins, and N. Jaitly, “Protein Secondary Structure Prediction Using Deep Multi-scale Convolutional Neural Networks and Next-Step Conditioning,” pp. 1–10, 2016, [Online]. Available: http://arxiv.org/abs/1611.01503.
[13] S. Wang, J. Peng, J. Ma, and J. Xu, “Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields,” Sci. Rep., vol. 6, pp. 109–114, 2016, doi: 10.1038/srep18962.
[14] Y. F. Huang and S. Y. Chen, “Extracting physicochemical features to predict protein secondary structure,” Sci. World J., vol. 2013, 2013, doi: 10.1155/2013/347106.
[15] C. L. Huang and C. J. Wang, “A GA-based feature selection and parameters optimizationfor support vector machines,” Expert Syst. Appl., vol. 31, no. 2, pp. 231–240, 2006, doi: 10.1016/j.eswa.2005.09.024.
[16] A. W. Fischer et al., “CASP11 - An evaluation of a modular BCL: Fold-based protein structure prediction pipeline,” PLoS One, vol. 11, no. 4, pp. 1–19, 2016, doi: 10.1371/journal.pone.0152517.
[17] Y. Qi, M. Oja, J. Weston, and W. S. Noble, “A unified multitask architecture for predicting local protein properties,” PLoS One, vol. 7, no. 3, p. e32235, 2012.
[18] A. Drozdetskiy, C. Cole, J. Procter, and G. J. Barton, “JPred4: a protein secondary structure prediction server,” Nucleic Acids Res., vol. 43, no. W1, pp. W389--W394, 2015.
[19] J. Zhou, H. Wang, Z. Zhao, R. Xu, and Q. Lu, “CNNH_PSS: Protein 8-class secondary structure prediction by convolutional neural network with highway,” BMC Bioinformatics, vol. 19, no. Suppl 4, 2018, doi: 10.1186/s12859-018-2067-8.
[20] B. Zhang, J. Li, and Q. Lü, “Prediction of 8-state protein secondary structures by a novel deep learning architecture,” BMC Bioinformatics, vol. 19, no. 1, pp. 1–13, 2018, doi: 10.1186/s12859-018-2280-5.
[21] E. Asgari and M. R. K. Mofrad, “Protvec: A continuous distributed representation of biological sequences,” Comput. Sci., vol. 10, no. 11, p. e0141287.
[22] M. E. Peters et al., “Deep contextualized word representations,” arXiv Prepr. arXiv1802.05365, 2018.
[23] J. Pevsner, Bioinformatics and Functional Genomics, vol. 3, no. 2. 2004.
[24] R. Schleif, Genetics and molecular biology, vol. 23, no. 2. 1993.
[25] C. H. Yang, Y. S. Lin, L. Y. Chuang, and Y. Da Lin, “Effective hybrid approach for protein structure prediction in a two-dimensional Hydrophobic–Polar model,” Comput. Biol. Med., vol. 113, no. August, p. 103397, 2019, doi: 10.1016/j.compbiomed.2019.103397.
[26] J. Meiler, M. Müller, A. Zeidler, and F. Schmäschke, “Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks,” J. Mol. Model., vol. 7, no. 9, pp. 360–369, 2001, doi: 10.1007/s008940100038.
[27] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
[28] H. Sak, A. Senior, and F. Beaufays, “Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition,” no. Cd, 2014, [Online]. Available: http://arxiv.org/abs/1402.1128.
[29] C. Olah, “Understanding LSTM Networks,” 2015. https://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed May 20, 2020).
[30] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, 1997, doi: 10.1109/78.650093.
[31] A. V Joshi, Machine learning and artificial intelligence. Springer, 2020.
[32] S. F. Altschul et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Res., vol. 25, no. 17, pp. 3389–3402, 1997.
[33] B. L. Miller, D. E. Goldberg, and others, “Genetic algorithms, tournament selection, and the effects of noise,” Complex Syst., vol. 9, no. 3, pp. 193–212, 1995.
[34] A. E. Eiben, J. E. Smith, and others, Introduction to evolutionary computing, vol. 53. Springer, 2003.