簡易檢索 / 詳目顯示

研究生: 李昭明
Chao - Ming Lee
論文名稱: A Graph-Based Approximate String Matching Method for Predicting Transcription Factor Binding Sites
以圖學理論為基礎的近似字串匹配來預測轉錄因子結合位
指導教授: 李漢銘
Hahn-Ming Lee
黃明經
Ming-Jing Hwang
口試委員: 李育杰
Yuh-Jye Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 109
中文關鍵詞: 轉錄因子近似匹配計分函數
外文關鍵詞: motif, transcription factor binding sites, approximate string match, suffix array, scoring function.
相關次數: 點閱:262下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 我們提出了一個新的尋找轉錄因子結合位的方法 採用suffix array 及 approximate string match來尋找implanted motif 及real motif.這兩個問題均能有良好的結果.本論文提出的架構顯示在hard challenge problem 中,motif 能很容易的被找到.同時所提出的兩個方程式Kullback Leibler 及Euler equation,其中Kullback Leibler是比較好的scoring function.


    In this paper we proposed a new computational scheme using suffix array and approximate string matching techniques to assess the capability of predicting implanted motifs and real motifs of transcription factor binding sites, a challenging problem of bioinformatics. We showed that several well-known “hard” problems of implanted motif can be solved efficiently with the proposed scheme, which obtained better results than previously reported algorithms.
    We also tested our scheme for the real motif finding problem by evaluating the predicting power of two scoring functions incorporating Euler and Kullback Leibler distance (KL) equations, respectively. Experiments showed that KL achieved a better performance than the Euler equation. However, we achieved only comparable performance in the competition with other motif finding tools using an on-line benchmark assessment service. Nevertheless, we learned from this thesis that the KL equation is a good scoring function for real motif prediction and good scoring functions play a key role in predicting transcription factor binding sites.
    Keyword: motif, transcription factor binding sites, approximate string match, suffix array, scoring function.

    Content Abstract 3 Acknowledgements 4 Content ………………………………………………………………….5 Chapter 1 Introduction 10 1.1 Background 10 1.2 Motivation 12 1.3 Motif Finding Problems 12 1.3.1 The Planted (l,d)-Variant Motif Problem 13 1.3.2 Real Motif 15 1.4 Outline of this Thesis 15 Chapter 2 Related Work 16 2.1 Motif Representation Models 16 2.1.1 Profile and Consensus Motif Models 16 2.1.2 Profile Motif Discovery 17 2.1.3 Consensus Motif Discovery 18 2.1.4 Background Models of Random Sequences 19 2.1.5 The Probability of Candidate Patterns and Their Significance 21 2.1.6 Statistical Overrepresentation with Scoring Function 21 2.1.7 The Distribution of Motifs 22 2.2 Motif Finding Algorithms 23 2.2.1 Candidate Motifs Extraction 26 2.2.2 Potential Motifs Scoring 27 2.2.3 Steps of Motif Finding Algorithms 28 2.3.1 Consensus 32 2.3.2 GibbsDNA 33 2.3.3 MEME-EM Algorithm: Probabilistic Approaches 33 2.3.4 ANN-Spec 34 2.3.5 AlignACE 34 2.3.6 GLAM 35 2.3.7 The Improbizer 35 2.3.8 MITRA: Mismatch (Prefix) Tree 35 2.3.9 MotifSampler: Probabilistic approaches 36 2.3.10 Oligo/dyad-analysis 36 2.3.11 QuickScore 36 2.3.12 SeSiMCMC: “Another Gibbs Sampler digging for DNA-motifs” 37 2.3.13 WEEDER: Tree-Based Algorithm (Suffix Tree) 37 2.3.14 YMF 43 2.4 Assessment of Computational Motif Discovery Tools 43 2.5 Algorithms for the Challenge Problem 45 Chapter 3 System Architecture 48 3.1 The Relative Distance between Motif-Background and Non-motif-Background Regions 48 3.2 System Architecture of Planted Motif Finder (PMF) 50 3.3 System Architecture of Biological Motif Finder (BMF) 62 3.3.1 Score Method of the Proposed Algorithm (a three phases’ scoring) 64 3.3.2 Architectures of the Proposed Method , Weeder and Styczynski et al’s Algorithm 65 3.4 Features of the Proposed Algorithm 68 Chapter 4 Experimental Results 72 4.1 The Difference between Motif-Background and Motif-Non-motif regions 72 4.2 Results of Implanted Motifs 76 4.3 Results of biological motifs 79 Chapter 5 Discussion 85 5.1 The Difference of Scoring Functions 85 5.2 The Advantages of Proposed IMF algorithm 86 5.3 The Advantages of Proposed BMF algorithm 87 5.4 SUMMARY 88 Chapter 6 Conclusion 90 6.1 Conclusion 90 6.2 Further Work 91 Reference 92 Appendix A(the process of monad problem): 102 Appendix B The Clique Algorithm[16] 105

    [1] Abouelhoda MI, Kurtz S, Ohlebusch E (2002). “The enhanced suffix array and its applications to genome analysis”. Proceedings of the Second Workshop on Algorithms in Bioinformatics, Springer-Verlag. LNCS 2452, pp. 449–463
    [2] Akutsu T, Arimura H, Shimozono S (2000). “On approximation algorithms for local multiple alignment”. In RECOMB, Tokyo, Japan, page:1-7
    [3] Apostolico A (1985). “The myriad virtues of subword trees”. In Combinatorial Algorithms on Words. A. Apostolico and Z. Galil, Eds. NATO ASI Series F: Computer and System Sciences, Springer- Verlag, New York, pp. 85-96
    [4] Apostolico A, Gong FC., and Lonardi S (2003). "Verbumculus and the Discovery of Unusual Words". Journal of Computer Science and Technology 19(1): 22–41
    [5] Ao W, Gaudet J, Kent WJ, Muttumu S, Mango SE. (2004). “Environmentally induced foregut remodeling by pha-4/foxa and daf-12/nhr”. Science, 305 (5691):1743–6
    [6] Bailey TL, Elkan C (1994). “Fitting a mixture model by expectation maximization to discover motifs in biopolymers”. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, Menlo Park, California, AAAI Press,2:page:28-36
    [7] Blanchette M, Sinha S.(2001). “Separating real motifs from their artifacts”. Bioinformatics.17 Suppl 1:S30-8
    [8] Brazma A, Jonassen I, Vilo J, Ukkonen E(2002). “Pattern Disocovery from Biosequences”. Master Thesis, University of Helsinki
    [9] Buhler J, Tompa M.(2001). “Finding motifs using random Projections”. Proceedings of the fifth annual international conference on Computational biology Montreal, Quebec, Canada, Pages: 69 - 76
    [10] Burge C, Campbell AM, Karlin S (1992). “Over- and under-representation of short oligonucleotides in DNA sequences”. Proc Natl Acad Sci U S A,15;89(4):1358-1362
    [11] Bussemaker HJ, Li H, Siggia ED (2000). “Regulatory element detection using a probabilistic segmentation model”. Proc Int Conf Intell Syst Mol Biol, 8:67-74
    [12] Choi J(2004). “A Critical Review of Computational Approaches on Finding Transcriptional Factor Binding Sites”. BIOCHEM218 Final Project Submitted for Credit in Computational Molecular Biology
    [13] Chen RM, Liu FM, Tsai JP and Shih SH (2004). “Fmga: finding motifs by genetic algorithm”. In 4th IEEE International Symposium on BioInformatics and BioEngineering, Taichung, Taiwan, 459- 466
    [14] Coessens B, Thijs G, Aerts S, Marchal K, De Smet F, Engelen K, Glenisson P, Moreau Y, Mathys J, and De Moor B (2003). "INCLUSive: A web portal and service registry for microarray and regulatory sequence analysis". NucleicAcids Res 31(13): 3468–70
    [15] Dermouche A (1995). "A fast algorithm for string matching with mismatches". Information Processing Letters 55(2): 105–110
    [16] Dharwadker A (2006). "The Clique Algorithm". From http://www.geocities.
    com/ dharwadker/clique/
    [17] Eskin E (2004). “From profiles to patterns and back again: a branch and bound. algorithm for finding near optimal motif profiles”. In Proceedings of the eighth annual international conference on Computational molecular biology, ACM Press
    [18] Eskin E, Pevzner PA (2002). "Finding composite regulatory patterns in DNA sequences". Bioinformatics 18(1): S354–63
    [19] F.N. Abu-Khzam, N.E. Baldwin, M.A. Langston, and N.F. Samatova(2005). “On the Relative Efficiency of Maximal Clique Enumeration Algorithms, with Applications to High-Throughput Computational Biology”. Proceedings, International Conference on Research Trends in Science and Technology, Beirut, Lebanon
    [20] Favorov,AV., Gelfand,MS., Mironov,AA, and Makeev, VJ.(2002). “Yet another digging for DNA motifs Gibbs sampler”. Proceedings of BGRS 2002, Novisibisk, 1, 31-33
    [21] Favorov AV, Gelfand MS, Gerasimova AV, Ravcheev DA, Mironov AA, and Makeev VJ(2005). “Gibbs sampler for identification of symmetrically structured, spaced dna motifs with improved estimation of the signal length”. Bioinformatics, 21(10):2240–2245
    [22] Fogel GB, Porto VW, Weekes DG, Fogel DB,Griffey RH,McNeil JA,Lesnik E, Ecker DJ, Sampath R(2002). "Discovery of RNA structural elements using evolutionary computation". Nucleic Acids Res 30(23): 5310-5317
    [23] Fogel GB, Weekes DG, Varga G, Dow ER, Harlow HB(2004). "Discovery of sequence motifs related to coexpression of genes using evolutionary computation". Nucleic Acids Res 32(13): 3826–35
    [24] Frith MC, Hansen U, Spouge JL,Weng Z(2004). "Finding functional sequence elements by multiple local alignment". Nucleic Acids Res 32(1): 189–200
    [25] Galil Z, Giancarlo R(1986). "Improved string matching with k mismatches". Sigact News 17(4): 52–54
    [26] Ganesh R, Siegele DA, Ioerger TR(2003). "MOPAC: motif finding by preprocessing and agglomerative clustering from microarrays". Pac Symp Biocomput: 41–52
    [27] Grossi R, Luccio F(1989). "Simple and efficient string matching with k mismatches". Information Processing Letters 33(3): 113–120
    [28] Grundy WN, Bailey TL, Elkan CP, Baker ME(1997). “Meta-MEME: motif-based hidden Markov models of protein families”. Comput Appl Biosci. Aug;13(4):397-406
    [29] GuhaThakurta D, Palomar L, Stormo GD, Tedesco P, Johnson TE, Walker DW, Lithgow G, Kim S, Link CD(2002). "Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods". Genome Res 12(5): 701–12
    [30] Gusfield D(1997). "Rare Events and Conditional Events on Random Strings". Discrete Mathematics and Theoretical Computer Science 6: 191–214
    [31] Häußler M, Nicolas J(2005). “Motif Discovery on Promotor Sequences”. Universi at Potsdam
    [32] Henry CM Leung and Francis YL Chin(2005). “An Efficient Algorithm for the Extended (l,d)-Motif Problem with Unknown Number of Binding Sites”. BIBE Proceedings of the Fifth IEEE Symposium on Bioinformatics and Bioengineering, Pages: 11 – 18
    [33] Hernandez D, Gras R, Appel R(2004). "MoDEL: an efficient strategy for ungapped local multiple alignment". Comput Biol Chem 28(2): 119–28
    [34] Hertz GZ, Stormo GD(1999). "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences". Bioinformatics 15(7-8): 563–77
    [35] Hertz GZ, Stormo GD(1999). "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences". Bioinformatics 15: 563 77
    [36] Hu YJ.(2003). "Finding subtle motifs with variable gaps in unaligned dna Sequences". Comput Methods Programs Biomed 70(1): 11–20
    [37] Hughes JD, Estep PW, Tavazoie S, Church GM(2000). "Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae". J Mol Biol 296(5): 1205–14
    [38] Jacques VH, Alma. FR and Julio CV(2000). "Discovering regulatory elements in non-coding sequences by analysis of spaced dyads". NucleicAcids Res 28(8): 1808–18
    [39] Jensen K., Styczynski M., Rigoutsos I, & Stephanopoulos G.(2005). "A generic motif discovery algorithm for sequential data". Bioinformatics, 22(1):21-28
    [40] Keich U and Pevzner PA(2002). "Subtle motifs: defining the limits of motif finding algorithms". Bioinformatics 18(10): 1382–90
    [41] Keich U and Pevzner PA(2002). "Finding motifs in the twilight zone". Bioinformatics 18(10): 1374-1381
    [42] Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, and Wootton JC (1989). "Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment". Science 262 (5131): 208–14
    [43] Liang S(2004). "cWINNOWER algorithm for finding fuzzy dna motifs". J Bioinform Comput Biol 2(1): 47–60
    [44]Landau GM, Vishkin U(1986). “Efficient Parallel And Serial Approximate String Matching”. Theoretical Computer Science, Volume 43 , Issue 2-3 ,Pages: 239 -249
    [45] Landau G.M. and Vishkin U(1986). "Efficient string matching with k mismatches". Theoretical Computer Science 43(2-3): 239–249
    [46] Liu JS(1994). "The collapsed gibbs sampler in bayesian computations with applications to a gene regulation problem". Journal of the American Statistical Association 29(427): 958–967
    [47] Liu X, Brutlag DL(2001). "BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes". Pac Symp Biocomput 7: 127–38
    [48]MacIsaac KD, Fraenkel E(2006). “Practical Strategies for Discovering Regulatory DNA Sequence Motifs”. PLoS Comput Biol; 2(4): e36
    [49] Mahony S, Hendrix D, Golden A, Smith TJ, Rokhsar DS(2005). "Transcription factor binding site identification using the Self-Organizing Map". Bioinformatics 21(9): 1807-14
    [50] Manber U, Myers E(1990). "Suffix Arrays: A new method for on-line string searches". First ACM-SIAM Symposium on Discrete Algorithms: 319-327
    [51] Marsan L, Sagot MF(2001). “Algorithms for extracting structured motifs using a suffix-tree with application to promoter and regulatory site consensus identification”. J. of Computational Biology, 7:345-360
    [52] Marinescu VD; Kohane IS; Riva A(2005). "MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes". BMC Bioinformatics 6:79
    [53] Michailidis PD. and Margaritis KG(2002). "On-Line Approximate String Searching Algorithms:Survey and Experimental Results". Intern. J. Computer Math 79(8): 867–888
    [54] Navarro G(1997). “A partial deterministic automaton for approximate string matching”. In Proc. of the 4th South American Workshop on String Processing Carleton University Press, Pages 112-124
    [55] Navarro G, Raffinot M(1998). “A bit-parallel approach to suffix automata: Fast extended string matching”. In Proc. of the 9th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, Berlin
    [56] Navarro G(2001). "NR-grep: a fast and flexible pattern matching tool". Software Practice and Experience 31: 1265-1312
    [57] Navarro G(2001). ”A Guided Tour to Approximate String Matching”. CM Computing Surveys (CSUR), Volume 33, Issue 1, pages: 31 - 88
    [58] Pavesi G; Mauri G; Pesole G(2001). "An algorithm for finding signals of unknown length in DNA sequences". Bioinformatics 17(1): S207–14. 28, 33
    [59] Pavesi G, Mereghetti P, Mauri G, Pesole G(2004). "Weeder Web: discovery of transcription factor binding sites”. Nucleic Acids Research 2(Web Server Issue): W199-W203
    [60] Pevzner PA, Waterman MS( 1995). ”Multiple filtration and approximate pattern matching”. Author Affiliation: Univ of Southern California Source: Algorithmica (New York) 13 1-2 Jan-Feb Springer-Verlag New York Inc p 135-154
    [61] Pevzner PA, Sze SH(2000). “Combinatorial approaches to finding subtle signals in DNA sequences”. Proc Int Conf Intell Syst Mol Biol, 8:269-78
    [62] Pudimat R; Schukat-Talamazzini EG; Backofen R(2005). "A multiple-feature framework for modelling and predicting transcription factor binding sites". Bioinformatics 21(14): 3082–3088
    [63] Ricardo AB, Gaston HG(1994). "Fast string matching with k mismatches". Information and Computation 108(2): 187–199
    [64] Ricardo AB, Chris HP(1996). "Fast and practical approximate string matching". Information Processing Letters 59(1): 21–27
    [65] Ricardo AB, Gaston HG(1992). "A new approach to text searching". Communications of the ACM 35(10): 74–82
    [66] Rigoutsos I and Floratos A(1998). "Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm". Bioinformatics 14(1): 55–67
    [67] Schbath S(1997). "An efficient statistic to detect over- and underrepresented words in DNA sequences". J Comput Biol 4(2): 189–92
    [68] Sinha S(2003). "Discriminative motifs". J Comput Biol 10(3-4): 599–615
    [69] Sinha S, Tompa M(2000). “An Statistical Method for Finding Transcription Factor Binding Sites”. In Proceedings of the Eighth International Conference on Interlligent Systems for Molecular Biology, 8:344~354
    [70] Sinha, S. & Tompa, M(2000). “YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation”. Nucleic Acids Research, 2003, Vol. 31, No. 13, 3586-3588
    [71] Sinha, S. and Tompa, M(2002). “Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation”. Nucleic Acids Research, vol. 30, no. 24, December, 5549-5560
    [72] Shinozaki D, Akutsu T, Maruyama O(2003). “Finding optimal degenerate patterns in DNA sequences”. Bioinformatics, 19, Suppl 2:II206-II214
    [73] Stefania B, Alessandro C, Andrea B, Cinzia P and Gian AD(2005). "A multistep bioinformatic approach detects putative regulatory elements in gene promoters". BMC Bioinformatics, 6: 121
    [74] Schneider T, Stephens M.(1990). “Sequence Logos: A New Way to Display Consensus Sequences”. Nucleic Acids Res, 18:6097-6100
    [75] Stormo GD and Hartzell GW(1989). " Identifying DNA and protein patterns with statistically significant alignments of multiple sequences ". Proc Natl Acad Sci U S A 86(4): 1183–7
    [76] Stormo GD(2000), “DNA binding sites: representation and discovery”. Bioinformatics, Jan;16(1):16-23
    [77] Stormo G. & Hartzell G(1989). ”Identifying protein-binding sites from unaligned
    DNA fragments”. Proc Natl Acad Sci U S A, 86 (4), 1183–7
    [78] Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos GN(2004). "An extension and novel solution to the (l,d)-motif challenge problem". Genome Informatics 15: 63-71
    [79] Sutou T, Tamura K, Mori Y and Kitakami H(2003). “Design and Implementation of Parallel Modified PrefixSpan Method”. Lecture Notes in Computer Science, Volume 2858, Springer Berlin, Heidelberg
    [80] Tarhio J, Ukkonen E(1993). "Approximate boyer-moore string matching". SIAM Journal on Computing 22(2): 243–260
    [81] Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouzé Pand Moreau Y (2001). "A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling". Bioinformatics 17(12): 1113-1122
    [82] Thompson W, Rouchka EC, Lawrence CE(2003). "Gibbs Recursive Sampler: finding transcription factor binding sites". Nucleic Acids Research 31(13): 3580-3585
    [83] Tompa M., Li N, Bailey TL. , Church G M. , De Moor B, Eskin E, Favorov A. V, Frith M. C, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, N, Sinha S, Thijs G, Van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, and Zhu Z(2005). "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites, Nature Biotechnology". 23(1): 137 - 144
    [84] Tomita E and Seki T(1989). “An Optimal Algorithm for finding all the cliques”. SIG Algorithms, 12, 91–98
    [85] Van Helden J, Andre B, Collado-Vides J(1998). "Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies". J Mol Biol 281(5): 827–42
    [86] VijayaSatya R. Mukheqee A(2004). “Pruner: algorithms for finding monad patterns in dna sequences”. In 3rd International IEEE Computer Society Computational Systems Bioinformatics Conference, Stanford, CA, USA IEEE Computer Society, pp. 662-665
    [87] Wang RL, Tang Z, Cao QP(2003). “An efficient approximation algorithm for finding a maximum clique using Hopfield network learning”. Neural Comput,15(7):1605-19
    [88] Wingender E , Dietze P , Karas H and Knüppel R(1996). "TRANSFAC: a database on transcription factors and their DNA binding sites". Nucleic Acids Res 24(1): 238-41
    [89] Workman CT, Stormo GD(2000). ”ANN-Spec: a method for discovering transcription factor binding sites with improved specificity”. Pac Symp Biocomput; 467 -78
    [90] Wu S, Manber U(1991). “AGREP - A Fast Approximate Pattern-matching Tool”. Proceedings of the Winter USENIX Conference San Francisco, USA, Berkeley, USA, 153-162
    [91] Yada T, Totoki Y, Ishikawa M, Asai K and Nakai K(1998). "Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences". Bioinformatics 14(4): 317–325
    [92] Zhou Q, Liu JS(2004). "Modeling within-motif dependence for transcription factor binding site predictions". Bioinformatics 20(6): 909–16
    [93] Zhu J, Zhang MQ(1999). "SCPD: a promoter database of the yeast Saccharomyces cerevisiae". Bioinformatics 15(7-8): 607-11

    QR CODE