基於深度學習和權重矩陣的密碼子優化方法｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	邱田保 TIAN-BAO CIOU
論文名稱：	基於深度學習和權重矩陣的密碼子優化方法 Codon Optimization Method Based on Deep Learning and Weight Matrix
指導教授：	張以全 I-Tsyuen Chang 周文奇 Wen-Chi Chou
口試委員:	張以全 I-Tsyuen Chang 周文奇 Wen-Chi Chou 張春梵 Chun-Fan Chang 劉孟昆 Meng-Kun Liu
學位類別：	碩士 Master
系所名稱：	工程學院 - 機械工程系 Department of Mechanical Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	中文
論文頁數：	112
中文關鍵詞：	深度學習、密碼子權重矩陣、密碼子優化
外文關鍵詞：	Deep Learning, Codon Weight Matrix, Codon Optimization
相關次數：	點閱：144 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本論文將深度學習以及生物遺傳密碼子的資訊相結合，將人類基因庫於Bi-LSTM 模型以及Transformer 模型進行訓練，並且於預測階段導入密碼子頻率矩陣，並且綜合比較兩種模型對於該矩陣加入後的效果，最終我們發現，在經過優化後的序列其核苷酸成分都會發生改變，而且胞嘧啶(C) 和鳥嘌呤(G) 的含量都會上升，但是我們也注意到Transformer 模型在對於有多種蛋白質型態的基因進行預測時，能夠更加去預測出符合原始基因序列的核苷酸，而不是一昧的增加胞嘧啶(C) 和鳥嘌呤(G) 的含量，最後我們進行了兩個基因的序列比對，了解到每段基因的優化區域都不太一定，而且受到密碼子權重矩陣的影響，每段基因優化的方向都會有些許的差異。

This paper combines deep learning techniques with the information of genetic codons
to optimize human gene sequences. We train the human gene pool using Bi-LSTM and
Transformer models. During the prediction phase, we introduce a codon frequency matrix
and compare the effects of incorporating this matrix on the two models. Our findings
reveal that after optimization, the nucleotide composition of the sequences undergoes
changes, with an increase in the content of cytosine (C) and guanine (G). However, we also observed that the Transformer model, when predicting genes with multiple protein types, is better at accurately predicting nucleotides that align with the original gene sequence, rather than simply increasing the content of cytosine (C) and guanine (G). Finally, we performed sequence alignments for two genes and discovered that the optimization regions for each gene are not necessarily the same, as they are influenced by the codon weight matrix, resulting in slight differences in the direction of gene optimization.

Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
Abstract in English . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . II
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIII
List of abbreviation . . .  . . . . . . . . . . . . . . . . . . . . . . . . . XIV
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Deoxyribonucleic acid and nucleotide . . . . . . . . . . . . . . . 1
1.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . 3
1.3 Long Short Term Memory Cell . . . . . . . . . . . . . . . . . . . 4
1.4 Sequence-to-Sequence Model Architecture . . . . . . . . . . . . 4
1.5 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Literature Review and Research Motivation . . . . . . . . . . . . . . . . . . . 7
1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 Effects of codon optimization . . . . . . . . . . . . . . . . . . . 7
1.2 Codon optimization methods . . . . . . . . . . . . . . . . . . . . 8
2 Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Contribution and structure of this paper . . . . . . . . . . . . . . . . . . 12
Deep Learning Model and Codon Frequency Matrix Import . . . . . . . . . . 13
1 Word Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Word Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Wrod embedding matrix . . . . . . . . . . . . . . . . . . . . . . 17
2 Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . 20
2.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Long Short Term Memory Network . . . . . . . . . . . . . . . . 28
5 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Self-Attention mechanism . . . . . . . . . . . . . . . . . . . . . 30
5.2 Positional encoding . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 LearningRateSchedule . . . . . . . . . . . . . . . . . . . . . . . 41
6 Codon Frequency Weight Matrix . . . . . . . . . . . . . . . . . . . . . . 41
6.1 Codon Frequency Weight Matrix Calculation . . . . . . . . . . . 42
6.2 Import the Codon Frequency Weight Matrix into the Bi-LSTM
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Import the Codon Frequency Weight Matrix into the Transformer
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7 Dataset Sources and Data Cleaning Steps . . . . . . . . . . . . . . . . . 53
Experimental Results and Analysis . . . . . . . . . .. . . . . . . . . . . . . . 56
1 Gene Data Preprocessing and Analysis . . . . . . . . . . . . . . . . . . . 56
2 Model Training Parameters and Training Results . . . . . . . . . . . . . 62
2.1 Hamster Dataset Training Results . . . . . . . . . . . . . . . . . 67
2.2 Human Protein Atlas Dataset Training Results . . . . . . . . . . 69
3 Analysis of Model Prediction Results . . . . . . . . . . . . . . . . . . . 71
3.1 Predictive Analysis of Human Protein Atlas Genes in Hamster
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2 Predictive Analysis of Hamster Genes in Human Model . . . . . 74
3.3 Predictive Analysis of Human Protein Atlas Genes in Human Model 80
Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . .  . . 90
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Letter of Authority . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 95

                                

[1] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural
networks,” 2014.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin, “Attention is all you need,” 2017.
[3] D. R. Goulet, Y. Yan, P. Agrawal, A. B. Waight, A. N.-s. Mak, and Y. Zhu, “Codon
optimization using a recurrent neural network,” Journal of Computational Biology,
vol. 30, no. 1, pp. 70–81, 2023. PMID: 35727687.
[4] H. Fu, Y. Liang, X. Zhong, Z. Pan, L. Huang, H. Zhang, Y. Xu, W. Zhou, and Z. Liu,
“Codon optimization with deep learning to enhance protein expression,” Scientific
Reports, vol. 10, p. 17617, 10 2020.
[5] R. Jain, A. Jain, E. Mauro, K. LeShane, and D. Densmore, “Icor: Improving codon
optimization with recurrent neural networks,” bioRxiv, 2022.
[6] R. Johnson and T. Zhang, “Supervised and semi-supervised text categorization using
lstm for region embeddings,” in International Conference on Machine Learning,
2016.
[7] A. Alexaki, J. Kames, D. D. Holcomb, J. Athey, L. V. Santana-Quintero, P. V. N.
Lam, N. Hamasaki-Katagiri, E. Osipova, V. Simonyan, H. Bar, A. A. Komar, and
C. Kimchi-Sarfaty, “Codon and codon-pair usage tables (cocoputs): Facilitating genetic
variation analyses and recombinant gene design,” Journal of Molecular Biology,
vol. 431, no. 13, pp. 2434–2441, 2019. Computation Resources for Molecular
Biology.
[8] J. Kames, A. Alexaki, D. D. Holcomb, L. V. Santana-Quintero, J. C. Athey,
N. Hamasaki-Katagiri, U. Katneni, A. Golikov, J. C. Ibla, H. Bar, and C. Kimchi-
Sarfaty, “Tissuecocoputs: Novel human tissue-specific codon and codon-pair usage
tables based on differential tissue gene expression,” Journal of Molecular Biology,
vol. 432, no. 11, pp. 3369–3378, 2020. Computation Resources for Molecular Biology.
[9] Y. Nakamura, T. Gojobori, and T. Ikemura, “Codon usage tabulated from international
DNA sequence databases: status for the year 2000,” Nucleic Acids Research,
vol. 28, pp. 292–292, 01 2000.
[10] G. Lithwick and H. Margalit, “Hierarchy of sequence-dependent features associated
with prokaryotic translation.,” Genome research, vol. 13 12, pp. 2665–73, 2003.
[11] C. Gustafsson, S. Govindarajan, and J. Minshull, “Codon bias and heterologous protein
expression,” Trends in Biotechnology, vol. 22, no. 7, pp. 346–353, 2004.
[12] L. Deml, A. Bojak, S. Steck, M. Graf, J. Wild, R. Schirmbeck, H. Wolf, and R. Wagner,
“Multiple effects of codon usage optimization on expression and immunogenicity
of dna candidate vaccines encoding the human immunodeficiency virus type 1
gag protein,” Journal of Virology, vol. 75, no. 22, pp. 10991–11001, 2001.
[13] N. A. Burgess-Brown, S. Sharma, F. Sobott, C. Loenarz, U. Oppermann, and
O. Gileadi, “Codon optimization can improve expression of human genes in escherichia
coli: A multi-gene study,” Protein Expression and Purification, vol. 59,
no. 1, pp. 94–102, 2008.
[14] S. Inouye, Y. Sahara-Miura, J. ichi Sato, and T. Suzuki, “Codon optimization of genes
for efficient protein expression in mammalian cells by selection of only preferred
human codons,” Protein Expression and Purification, vol. 109, pp. 47–54, 2015.
[15] Y. Tarakaram, Y. Mounika, Y. Lakshmi Prasanna, and T. Singh, “Codon optimization
and converting dna sequence into protein sequence using deep neural networks,” in
2021 12th International Conference on Computing Communication and Networking
Technologies (ICCCNT), pp. 1–5, July 2021.
[16] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language
model,” J. Mach. Learn. Res., vol. 3, p. 1137–1155, mar 2003.
[17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations
in vector space,” 2013.
[18] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,”
in Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), (Doha, Qatar), pp. 1532–1543, Association for
Computational Linguistics, Oct. 2014.
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
[20] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–
211, 1990.
[21] P. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings
of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[22] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen Netzen,” Master’s thesis,
Institut fur Informatik, Technische Universitat, Munchen, vol. 1, pp. 1–150, 1991.
[23] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation,
vol. 9, pp. 1735–1780, 11 1997.
[24] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,”
2015.
[25] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning
to align and translate,” 2016.
[26] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based
neural machine translation,” 2015.
[27] M. Uhlén, L. Fagerberg, B. M. Hallström, C. Lindskog, P. Oksvold, A. Mardinoglu,
Åsa Sivertsson, C. Kampf, E. Sjöstedt, A. Asplund, I. Olsson, K. Edlund, E. Lundberg,
S. Navani, C. A.-K. Szigyarto, J. Odeberg, D. Djureinovic, J. O. Takanen,
S. Hober, T. Alm, P.-H. Edqvist, H. Berling, H. Tegel, J. Mulder, J. Rockberg,
P. Nilsson, J. M. Schwenk, M. Hamsten, K. von Feilitzen, M. Forsberg, L. Persson,
F. Johansson, M. Zwahlen, G. von Heijne, J. Nielsen, and F. Pontén, “Tissue-based
map of the human proteome,” Science, vol. 347, no. 6220, p. 1260419, 2015.

全文公開日期 2026/08/11 (校內網路)
全文公開日期 2026/08/11 (校外網路)
全文公開日期 2026/08/11 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文