簡易檢索 / 詳目顯示

研究生: 羅百玲
Pai-Ling Lo
論文名稱: 一種非監督式樣型辨識方法尋找辨認調控因子結合位--植基於DNA微小序列片段特徵之研究
An Unsupervised Pattern Recognition Method for Identifying TFBS Based on DNA Short Sequence Features
指導教授: 何建明
Jan-Ming Ho
李漢銘
Hahn-Ming Lee
口試委員: 鮑興國
Hsing-Kuo Pao
廖宜恩
I-En Liao
黃明經
Ming-Jing Huang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 80
中文關鍵詞: 調控因子片段序列特徵非監督式樣型辨識
外文關鍵詞: binding site, unsupervised Pattern, Transcription factor
相關次數: 點閱:305下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 大部分的調控因子結合位都位在基因的上游序列中,為了要瞭解基因被調控的機制,辨認出這些調控因子的結合位,是這個任務的首要工作。然而,最近針對辨認調控因子結合位的方法所發表的評量顯示,這項工作仍然非常具有挑戰性,特別是較高等與複雜的生物,例如人類。這項工作受限於調控機制不明,因此沒有一個確切的表達方式,來描述同一個調控因子的結合位,除此之外,上游序列中只有很小一部分的基因體會與調控因子結合,所以這項工作也受限於巨大的背景雜訊,因此這項工作至今仍然是一項重要而難解的生物問題。
    在本篇論文當中,我們提出了一個非監督式樣型辨識方法,來處理這種含有大量的背景雜訊以及並非所有結合位都已被發現的不完全(incomplete)且不平均(unbalanced)生物資料。為了要模擬基因上游序列中每個片段結合的活性,我們提出了利用片段中的短序列之特徵以及其頻率的特徵向量。當特徵向量被群聚分群在不同的節點上時,為了要辨認出可能含有結合位的節點,我們考慮了每個節點的過度表現性以及在每個序列的分佈情形。為了要評估我們方法的表現,我們採用了最近一篇針對辨認調控因子結合位的計算方法做出評估的期刊論文所提供的基準實驗資料。經由實驗結果顯示,我們所提出的方法,與相關的類似方法(SOMBRERO)比較,較能將有可能是結合位的樣型在在較前面的名次預測出,表示我們所提出的結合位的表示方式,以及排序的方法,使預測調控因子結合位時較有效率。


    Identifying binding sites for the transcription factor in the upstream sequences of genes to which the factor binds is the first step to understand the gene regulatory mechanism. Recent assessment of computational tools for identifying these binding sites indicates that identifying these regulatory elements remains a challenging task in higher organisms, such as the human species. The task is limited in the intrinsic subtlety of binding sites and the huge background noise. That is, only a small portion of genome will be bound by transcription factors and sequence-specific recognition for binding is subtle.
    In this thesis, we proposed an unsupervised pattern recognition method to handle the incomplete and unbalanced biological data. To model the binding activity, a vector of small sequence features was proposed. To identify candidate pattern for binding sites, the overall over-representative and sequence popularity of each pattern are taken into consideration in ranking. To evaluate the performance and to compare with related work, a benchmark which has been used to assess existing tools was adopted. The experimental results show that the proposed methodology outperforms the related work in terms of the nucleotide level performance when we only consider the top 3 nodes in the ranking. The proposed representation of binding sites and the ranking mechanism make efficient predictions of binding sites.

    Abstract II Contents V List of Figures VII List of Tables IX Chpater 1 Introduction 1 1.1 Motivation 1 1.2 The Challenges of Current Research 3 1.2.1 Intrinsic Subtlety 3 1.2.2 Huge Noise Background 3 1.3 Goals 4 1.4 Outline of the Thesis 4 Chpater 2 Background 6 2.1 Transcription Factor Binding Site 6 2.1.1 Regulation of Gene Expression 6 2.1.2 Regulatory sequence 7 2.1.3 Representation of binding sites 8 2.2 Related work 10 2.2.1 MEME [2] 10 2.2.2 SOMBRERO [26] 11 2.2.3 Limitations 12 2.3 Tools 12 2.3.1 Markov model 12 2.3.2 Kohonen Self-Organizing Map [21] 13 Chpater 3 TFBS Identifier 15 3.1 Overview of Proposed Methodology 15 3.2 Preprocessing 18 3.2.1 Sliding Segmentation 19 3.2.2 Feature Vector Coding 19 3.3 Self-Organizing Map Clustering 22 3.3.1 Structure of the Self-Organizing Map 23 3.3.2 Training Phase 24 3.3.3 Mapping Phase 26 3.4 Ranking Mechanism 28 3.4.1 Over-Representative Score 30 3.4.2 Sequence Popularity Score 31 3.5 Characteristics of the Proposed Methodology 32 Chpater 4 Experimental Results 36 4.1 Experimental Design 36 4.1.1 Dataset Description 36 Evaluation Criteria 41 4.2 Performance 45 4.2.1 Parameter Settings 45 4.2.2 Nucleotide Level Performance Evaluation 47 4.3 Comparison with Related Work 49 Chpater 5 Discussions and Conclusions 55 5.1 Discussions 55 5.1.1 Advantages of the Proposed Methodology 55 5.1.2 Limitations of the Proposed Methodology 56 5.2 Conclusions 56 5.3 Further Work 57 References 59 Vita 66

    [1] Wanyuan Ao, Jeb Gaudet, W. James Kent, Srikanth Muttumu, and S. E. Mango, “enironmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR,” Science, vol. 305, pp. 1743-6, 2004.
    [2] Timothy L. Bailey and C. Elkan, “Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization,” Machine Learning, vol. 21, pp. 51-83, 1995.
    [3] Panayiotis V. Benos, Martha L. Bulyk, and G. D. Stormo, “Additivity in protein-DNA interactions: how good an approximation is it?,” Nucleic Acid Res, vol. 30, pp. 4442-51, 2002.
    [4] Benjamin P. Berman, Yutaka Nibu, Barret D. Pfeiffer, Pavel Tomancak, Susan E. Celniker, Michael Levine, Gerald M. Rubin, and M. B. Eisen, “Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome,” Proc Natl Acad Sci, vol. 99, pp. 757-62, 2002.
    [5] M. D. Biggin, “To bind or not to bind,” nature genetics, vol. 28, pp. 303-304, 2001.
    [6] M. L. Bulyk, “Computational prediction of transcription-factor binding site locations,” Genome Biol., vol. 5, pp. 201, 2003.
    [7] Martha L. Bulyk, Philip L. F. Johnson, and G. M. Church, “Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors,” Nucleic Acids Res., vol. 30, pp. 1255-61, 2002.
    [8] Moises Burset and R. Guigo, “Evaluation of gene structure prediction programs,” Genomics., vol. 34, pp. 353-67, 1996.
    [9] A. Comish-Bowden, “Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984,” Nucleic Acid Res., vol. 13, pp. 3021-3030, 1985.
    [10] Arthur L. Delcher, Douglas Harmon, Simon Kasif, Owen White, and S. L. Salzberg, “Improved microbial gene identification with GLIMMER,” Nucleic Acids Res., vol. 27, pp. 4636-41, 1999.
    [11] Eleazar Eskin and P. A. Pevzner, “Finding composite regulatory patterns in DNA sequences,” Bioinformatics, vol. 18 Suppl 1, pp. S354-63, 2002.
    [12] A. V. Favorov, M. S. Gelfand, A. V. Gerasimova, D. A. Ravcheev, A. A. Mironov, and V. J. Makeev, "A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length," Bioinformatics, vol. 21, pp. 2240-5, 2005.
    [13] Diane E. Frank, Ruth M. Saecker, Jeffrey P. Bond, Michael W. Capp, Oleg V. Tsodikov, Sonya E. Melcher, Mark M. Levandoski, and J. M. Thomas Record, “Thermodynamics of the interactions of lac repressor with variants of the symmetric lac operator: effects of converting a consensus site to a non-specific site,” J Mol Biol., vol. 267, pp. 1186-206, 1997.
    [14] Martin C. Frith, Ulla Hansen, John L. Spouge, and Z. Weng, “Finding functional sequence elements by multiple local alignment,” Nucleic Acids Res., vol. 32, pp. 189-200, 2004.
    [15] K. Grzeskowiak, “Sequence-dependent structural variation in B-DNA,” Chem Biol, vol. 3, pp. 785-90, 1996.
    [16] J. van Helden, B. Andre, and J. Collado-Vides., “Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies,” J Mol Biol., vol. 281, pp. 827-42, 1998.
    [17] Jacques van Helden, Alma. F. Rios, and J. Collado-Vides, “Discovering regulatory elements in non-coding sequences by analysis of spaced dyads,” Nucleic Acid Res, vol. 28, pp. 1808-18, 2000.
    [18] Gerald Z. Hertz and G. D. Stormo, “Identifying DNA and protein patterns with statistically significant alignments of multiple sequences,” Bioinformatics, vol. 15, pp. 563-77, 1999.
    [19] Jason D. Hughes, Preston W. Estep, Saeed Tavazoie, and G. M. Church, "Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae," J Mol Biol., vol. 296, pp. 1205-14., 2000.
    [20] Ronald Jansen and M. Gerstein, “Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction,” Curr Opin Microbiol., vol. 7, pp. 535-545, 2004.
    [21] T. Kohonen, Self-Organizing Maps. Berlin: Springer-Verlag, 1995.
    [22] A. Krogh, “Two methods for improving performance of an HMM and their application for gene finding,” Proc Int Conf Intell Syst Mol Biol., vol. 5, pp. 179-86, 1997.
    [23] M.L. Lee, M. L. Bulyk, G. A. Whitmore and G. M. Church, “A statistical model for investigating binding probabilities of DNA nucleotide sequences using microarrays,” Biometrics, vol. 58, pp. 981-8, 2002.
    [24] Nan Li and M. Tompa, “Analysis of computational approaches for motif discovery,” Algorithms Mol Biol., vol. 1, pp. 8, 2006.
    [25] Alexander V. Lukashin and M. Borodovsky, “GeneMark.hmm: new solutions for gene finding,” Nucleic Acids Res., vol. 26, pp. 1107-15, 1998.
    [26] Shaun Mahony, David Hendrix, Aaron Golden, Terry J. Smith, and D. S. Rokhsar, “Transcription factor binding site identification using the self-organizing map,” Bioinformatics, vol. 21, pp. 1807-14, 2005.
    [27] Shaun Mahony, James O McInerney, Terry J Smith, and A. Golden, “Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models,” BMC Bioinformatics, vol. 5, pp. 23, 2004.
    [28] Tsz-Kwong Man and G. D. Stormo, “Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay,” Nucleic Acid Res, vol. 29, pp. 2471-8, 2001.
    [29] Daniel Meierhans, Martin Sieber, and R. K. Allemann, “High affinity binding of MEF-2C correlates with DNA bending,” Nucleic Acids Res., vol. 25, pp. 4537-44, 1997.
    [30] Leelavati Narlikar and A. J. Hartemink, “Sequence features of DNA binding sites reveal structural class of associated transcription factor,” Bioinformatics, vol. 22, pp. 157-63, 2006.
    [31] William Stafford Noble, Scott Kuehn, Robert Thurman, Man Yu, and J. Stamatoyannopoulos, “Predicting the in vivo signature of human gene regulatory sequences,” Bioinformatics., vol. 21 Suppl 1, pp. i338-43, 2005.
    [32] Lino Ometto, Wolfgang Stephan, and D. D. Lorenzo, “Insertion/Deletion and Nucleotide Polymorphism Data Reveal Constraints in Drosophila melanogaster Introns and Intergenic Regions,” Genetics, vol. 169, pp. 1521-7, 2005.
    [33] A. Papavassiliou, Transcription factors in eukaryotes: Landes Bioscience, 1997.
    [34] Giulio Pavesi, Giancarlo Mauri, and G. Pesole, “An algorithm for finding signals of unknown length in DNA sequences,” Bioinformatics., vol. 17 Suppl 1, pp. S207-14, 2001.
    [35] Giulio Pavesi, Paolo Mereghetti, Giancarlo Mauri, and G. Pesole, “Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes,” Nucleic Acids Research, vol. 32, pp. W199-W203, 2004.
    [36] P. A. Pevzner and S. H. Sze, “Combinatorial approaches to finding subtle signals in DNA sequences,” Proc Int Conf Intell Syst Mol Biol., vol. 8, pp. 269-278, 2000.
    [37] Julia V. Ponomarenko, Mikhail P. Ponomarenko, Anatoly S. Frolov, Denis G. Vorobyev, G. Christian Overton, and N. A. Kolchanov, “Conformational and physicochemical DNA features specific for transcription factor binding sites,” Bioinformatics, vol. 15, pp. 654-68, 1999.
    [38] Mireille Regnier and A. Denise, “Rare Events and Conditional Events on Random Strings,” Math. Theor. Comput. Sci., vol. 6, pp. 191-214, 2004.
    [39] Peter J. Sabo, Richard Humbert, Michael Hawrylycz, James C. Wallace, Michael O. Dorschner, Michael McArthur, and J. A. Stamatoyannopoulos, “Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries,” Proc Natl Acad Sci, vol. 101, pp. 4537-42, 2004.
    [40] Saurabh Sinha and M. Tompa, “YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation,” Nucleic Acid Res, vol. 31, pp. 3586-8, 2003.
    [41] Barry D. Starr, Barbara C. Hoopes and Diane K. Hawley, “DNA bending is an important component of site-specific recognition by the TATA binding protein,” J Mol Biol., vol. 250, pp. 434-46, 1995.
    [42] M. Stepanova, T. Tiazhelova, M. Skoblov, and A. Baranova., “A comparative analysis of relative occurrence of transcription factor binding sites in vertebrate genomes and gene promoter areas,” Bioinformatics, vol. 21, pp. 1789-96, 2005.
    [43] Gary D.Stormo, Thomas D.Schneider, and L. M.Gold, “Characterization of translational initiation sites in E. coli,” Nucleic Acids Res., vol. 10, pp. 2971-96, 1982.
    [44] G. D. Stormo, “DNA binding sites: representation and discovery,” Bioinformatics, vol. 16, pp. 16-23, 2000.
    [45] Masashi Suzuki, Naoki Amano, Jun Kakinuma, and M. Tateno, “Use of a 3D structure data base for understanding sequence-dependent conformational aspects of DNA,” J Mol Biol., vol. 274, pp. 421-35, 1997.
    [46] Gert Thijs, Magali Lescot, Kathleen Marchal, Stephane Rombauts, Bart De Moor, Pierre Rouze, and Y. Moreau, “A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling,” Bioinformatics., vol. 17, pp. 1113-1122, 2001.
    [47] Martin Tompa, Nan Li, Timothy L Bailey, George M Church, Bart De Moor, Eleazar Eskin, Alexander V Favorov, Martin C Frith, Yutao Fu, W James Kent, Vsevolod J Makeev, Andrei A Mironov, William Stafford Noble, Giulio Pavesi, Graziano Pesole, Mireille Re'gnier, Nicolas Simonis, Saurabh Sinha, Gert Thijs, Jacques van Helden, Mathias Vandenbogaert, Zhiping Weng, Christopher Workman, Chun Ye, and Z. Zhu, “Assessing computational tools for the discovery of transcription factor binding sites,” Nat Biotechnol., vol. 23, pp. 137-44, 2005.
    [48] Huai-Chun Wang, Jonathan Badger, Paul Kearney, and M. Li, “Analysis of codon usage patterns of bacterial genomes using the self-organizing map,” Mol Biol Evol., vol. 18, pp. 792-800, 2001.
    [49] James D. Watson, Mark Zoller, Michael Gilman, and J. Witkowski, Recombinant DNA, 2nd ed: W. H. Freeman, 1992.
    [50] E. Wingender, P. Dietze, H. Karas, and R. Knuppel, “TRANSFAC: a database on transcription factors and their DNA binding sites,” Nucleic Acids Res., vol. 24, pp. 238-41, 1996.
    [51] C. T. Workman and G. D. Stormo., “ANN-Spec: a method for discovering transcription factor binding sites with improved specificity,” presented at Pac Symp Biocomput., 2000.
    [52] Jian Zhu and M. Q. Zhang, “SCPD: a promoter database of the yeast Saccharomyces cerevisia,” Bioinformatics, vol. 15, pp. 607-611, 1999.

    QR CODE