簡易檢索 / 詳目顯示

研究生: 邱田保
TIAN-BAO CIOU
論文名稱: 基於深度學習和權重矩陣的密碼子優化方法
Codon Optimization Method Based on Deep Learning and Weight Matrix
指導教授: 張以全
I-Tsyuen Chang
周文奇
Wen-Chi Chou
口試委員: 張以全
I-Tsyuen Chang
周文奇
Wen-Chi Chou
張春梵
Chun-Fan Chang
劉孟昆
Meng-Kun Liu
學位類別: 碩士
Master
系所名稱: 工程學院 - 機械工程系
Department of Mechanical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 112
中文關鍵詞: 深度學習密碼子權重矩陣密碼子優化
外文關鍵詞: Deep Learning, Codon Weight Matrix, Codon Optimization
相關次數: 點閱:144下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文將深度學習以及生物遺傳密碼子的資訊相結合,將人類基因庫於Bi-LSTM 模型以及Transformer 模型進行訓練,並且於預測階段導入密碼子頻率矩陣,並且綜合比較兩種模型對於該矩陣加入後的效果,最終我們發現,在經過優化後的序列其核苷酸成分都會發生改變,而且胞嘧啶(C) 和鳥嘌呤(G) 的含量都會上升,但是我們也注意到Transformer 模型在對於有多種蛋白質型態的基因進行預測時,能夠更加去預測出符合原始基因序列的核苷酸,而不是一昧的增加胞嘧啶(C) 和鳥嘌呤(G) 的含量,最後我們進行了兩個基因的序列比對,了解到每段基因的優化區域都不太一定,而且受到密碼子權重矩陣的影響,每段基因優化的方向都會有些許的差異。


    This paper combines deep learning techniques with the information of genetic codons
    to optimize human gene sequences. We train the human gene pool using Bi-LSTM and
    Transformer models. During the prediction phase, we introduce a codon frequency matrix
    and compare the effects of incorporating this matrix on the two models. Our findings
    reveal that after optimization, the nucleotide composition of the sequences undergoes
    changes, with an increase in the content of cytosine (C) and guanine (G). However, we also observed that the Transformer model, when predicting genes with multiple protein types, is better at accurately predicting nucleotides that align with the original gene sequence, rather than simply increasing the content of cytosine (C) and guanine (G). Finally, we performed sequence alignments for two genes and discovered that the optimization regions for each gene are not necessarily the same, as they are influenced by the codon weight matrix, resulting in slight differences in the direction of gene optimization.

    Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIII List of abbreviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIV 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Deoxyribonucleic acid and nucleotide . . . . . . . . . . . . . . . 1 1.1.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Long Short Term Memory Cell . . . . . . . . . . . . . . . . . . . 4 1.1.4 Sequence-to-Sequence Model Architecture . . . . . . . . . . . . 4 1.1.5 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Literature Review and Research Motivation . . . . . . . . . . . . . . . . . . . 7 2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Effects of codon optimization . . . . . . . . . . . . . . . . . . . 7 2.1.2 Codon optimization methods . . . . . . . . . . . . . . . . . . . . 8 2.2 Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Contribution and structure of this paper . . . . . . . . . . . . . . . . . . 12 3 Deep Learning Model and Codon Frequency Matrix Import . . . . . . . . . . 13 3.1 Word Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Word Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 Wrod embedding matrix . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Parameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.3 AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.4 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.5 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.1 Long Short Term Memory Network . . . . . . . . . . . . . . . . 28 3.5 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.1 Self-Attention mechanism . . . . . . . . . . . . . . . . . . . . . 30 3.5.2 Positional encoding . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5.3 Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5.4 LearningRateSchedule . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Codon Frequency Weight Matrix . . . . . . . . . . . . . . . . . . . . . . 41 3.6.1 Codon Frequency Weight Matrix Calculation . . . . . . . . . . . 42 3.6.2 Import the Codon Frequency Weight Matrix into the Bi-LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.3 Import the Codon Frequency Weight Matrix into the Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7 Dataset Sources and Data Cleaning Steps . . . . . . . . . . . . . . . . . 53 4 Experimental Results and Analysis . . . . . . . . . .. . . . . . . . . . . . . . 56 4.1 Gene Data Preprocessing and Analysis . . . . . . . . . . . . . . . . . . . 56 4.2 Model Training Parameters and Training Results . . . . . . . . . . . . . 62 4.2.1 Hamster Dataset Training Results . . . . . . . . . . . . . . . . . 67 4.2.2 Human Protein Atlas Dataset Training Results . . . . . . . . . . 69 4.3 Analysis of Model Prediction Results . . . . . . . . . . . . . . . . . . . 71 4.3.1 Predictive Analysis of Human Protein Atlas Genes in Hamster Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.2 Predictive Analysis of Hamster Genes in Human Model . . . . . 74 4.3.3 Predictive Analysis of Human Protein Atlas Genes in Human Model 80 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Letter of Authority . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    [1] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural
    networks,” 2014.
    [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
    and I. Polosukhin, “Attention is all you need,” 2017.
    [3] D. R. Goulet, Y. Yan, P. Agrawal, A. B. Waight, A. N.-s. Mak, and Y. Zhu, “Codon
    optimization using a recurrent neural network,” Journal of Computational Biology,
    vol. 30, no. 1, pp. 70–81, 2023. PMID: 35727687.
    [4] H. Fu, Y. Liang, X. Zhong, Z. Pan, L. Huang, H. Zhang, Y. Xu, W. Zhou, and Z. Liu,
    “Codon optimization with deep learning to enhance protein expression,” Scientific
    Reports, vol. 10, p. 17617, 10 2020.
    [5] R. Jain, A. Jain, E. Mauro, K. LeShane, and D. Densmore, “Icor: Improving codon
    optimization with recurrent neural networks,” bioRxiv, 2022.
    [6] R. Johnson and T. Zhang, “Supervised and semi-supervised text categorization using
    lstm for region embeddings,” in International Conference on Machine Learning,
    2016.
    [7] A. Alexaki, J. Kames, D. D. Holcomb, J. Athey, L. V. Santana-Quintero, P. V. N.
    Lam, N. Hamasaki-Katagiri, E. Osipova, V. Simonyan, H. Bar, A. A. Komar, and
    C. Kimchi-Sarfaty, “Codon and codon-pair usage tables (cocoputs): Facilitating genetic
    variation analyses and recombinant gene design,” Journal of Molecular Biology,
    vol. 431, no. 13, pp. 2434–2441, 2019. Computation Resources for Molecular
    Biology.
    [8] J. Kames, A. Alexaki, D. D. Holcomb, L. V. Santana-Quintero, J. C. Athey,
    N. Hamasaki-Katagiri, U. Katneni, A. Golikov, J. C. Ibla, H. Bar, and C. Kimchi-
    Sarfaty, “Tissuecocoputs: Novel human tissue-specific codon and codon-pair usage
    tables based on differential tissue gene expression,” Journal of Molecular Biology,
    vol. 432, no. 11, pp. 3369–3378, 2020. Computation Resources for Molecular Biology.
    [9] Y. Nakamura, T. Gojobori, and T. Ikemura, “Codon usage tabulated from international
    DNA sequence databases: status for the year 2000,” Nucleic Acids Research,
    vol. 28, pp. 292–292, 01 2000.
    [10] G. Lithwick and H. Margalit, “Hierarchy of sequence-dependent features associated
    with prokaryotic translation.,” Genome research, vol. 13 12, pp. 2665–73, 2003.
    [11] C. Gustafsson, S. Govindarajan, and J. Minshull, “Codon bias and heterologous protein
    expression,” Trends in Biotechnology, vol. 22, no. 7, pp. 346–353, 2004.
    [12] L. Deml, A. Bojak, S. Steck, M. Graf, J. Wild, R. Schirmbeck, H. Wolf, and R. Wagner,
    “Multiple effects of codon usage optimization on expression and immunogenicity
    of dna candidate vaccines encoding the human immunodeficiency virus type 1
    gag protein,” Journal of Virology, vol. 75, no. 22, pp. 10991–11001, 2001.
    [13] N. A. Burgess-Brown, S. Sharma, F. Sobott, C. Loenarz, U. Oppermann, and
    O. Gileadi, “Codon optimization can improve expression of human genes in escherichia
    coli: A multi-gene study,” Protein Expression and Purification, vol. 59,
    no. 1, pp. 94–102, 2008.
    [14] S. Inouye, Y. Sahara-Miura, J. ichi Sato, and T. Suzuki, “Codon optimization of genes
    for efficient protein expression in mammalian cells by selection of only preferred
    human codons,” Protein Expression and Purification, vol. 109, pp. 47–54, 2015.
    [15] Y. Tarakaram, Y. Mounika, Y. Lakshmi Prasanna, and T. Singh, “Codon optimization
    and converting dna sequence into protein sequence using deep neural networks,” in
    2021 12th International Conference on Computing Communication and Networking
    Technologies (ICCCNT), pp. 1–5, July 2021.
    [16] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language
    model,” J. Mach. Learn. Res., vol. 3, p. 1137–1155, mar 2003.
    [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations
    in vector space,” 2013.
    [18] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,”
    in Proceedings of the 2014 Conference on Empirical Methods in Natural
    Language Processing (EMNLP), (Doha, Qatar), pp. 1532–1543, Association for
    Computational Linguistics, Oct. 2014.
    [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
    [20] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–
    211, 1990.
    [21] P. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings
    of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
    [22] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen Netzen,” Master’s thesis,
    Institut fur Informatik, Technische Universitat, Munchen, vol. 1, pp. 1–150, 1991.
    [23] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation,
    vol. 9, pp. 1735–1780, 11 1997.
    [24] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,”
    2015.
    [25] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning
    to align and translate,” 2016.
    [26] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based
    neural machine translation,” 2015.
    [27] M. Uhlén, L. Fagerberg, B. M. Hallström, C. Lindskog, P. Oksvold, A. Mardinoglu,
    Åsa Sivertsson, C. Kampf, E. Sjöstedt, A. Asplund, I. Olsson, K. Edlund, E. Lundberg,
    S. Navani, C. A.-K. Szigyarto, J. Odeberg, D. Djureinovic, J. O. Takanen,
    S. Hober, T. Alm, P.-H. Edqvist, H. Berling, H. Tegel, J. Mulder, J. Rockberg,
    P. Nilsson, J. M. Schwenk, M. Hamsten, K. von Feilitzen, M. Forsberg, L. Persson,
    F. Johansson, M. Zwahlen, G. von Heijne, J. Nielsen, and F. Pontén, “Tissue-based
    map of the human proteome,” Science, vol. 347, no. 6220, p. 1260419, 2015.

    無法下載圖示 全文公開日期 2026/08/11 (校內網路)
    全文公開日期 2026/08/11 (校外網路)
    全文公開日期 2026/08/11 (國家圖書館:臺灣博碩士論文系統)
    QR CODE