A Graph-Based Approximate String Matching Method for Predicting Transcription Factor Binding Sites

簡易檢索 / 詳目顯示

回結果列表

研究生：	李昭明 Chao - Ming Lee
論文名稱：	A Graph-Based Approximate String Matching Method for Predicting Transcription Factor Binding Sites 以圖學理論為基礎的近似字串匹配來預測轉錄因子結合位
指導教授：	李漢銘 Hahn-Ming Lee 黃明經 Ming-Jing Hwang
口試委員:	李育杰 Yuh-Jye Lee
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2007
畢業學年度：	95
語文別：	英文
論文頁數：	109
中文關鍵詞：	轉錄因子、近似匹配、計分函數
外文關鍵詞：	motif, transcription factor binding sites, approximate string match, suffix array, scoring function.
相關次數：	點閱：262 下載：2
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

我們提出了一個新的尋找轉錄因子結合位的方法　採用suffix array 及　approximate string match來尋找implanted motif 及real motif.這兩個問題均能有良好的結果.本論文提出的架構顯示在hard challenge problem 中,motif 能很容易的被找到.同時所提出的兩個方程式Kullback Leibler 及Euler equation,其中Kullback Leibler是比較好的scoring function.

In this paper we proposed a new computational scheme using suffix array and approximate string matching techniques to assess the capability of predicting implanted motifs and real motifs of transcription factor binding sites, a challenging problem of bioinformatics. We showed that several well-known “hard” problems of implanted motif can be solved efficiently with the proposed scheme, which obtained better results than previously reported algorithms.
We also tested our scheme for the real motif finding problem by evaluating the predicting power of two scoring functions incorporating Euler and Kullback Leibler distance (KL) equations, respectively. Experiments showed that KL achieved a better performance than the Euler equation. However, we achieved only comparable performance in the competition with other motif finding tools using an on-line benchmark assessment service. Nevertheless, we learned from this thesis that the KL equation is a good scoring function for real motif prediction and good scoring functions play a key role in predicting transcription factor binding sites.
Keyword: motif, transcription factor binding sites, approximate string match, suffix array, scoring function.

Content
Abstract	3
Acknowledgements	4
Content	………………………………………………………………….5
Chapter 1 Introduction	10
1 Background	10
2 Motivation	12
3 Motif Finding Problems	12
3.1 The Planted (l,d)-Variant Motif Problem	13
3.2 Real Motif	15
4 Outline of this Thesis	15
Chapter 2 Related Work	16
1 Motif Representation Models	16
1.1 Profile and Consensus Motif Models	16
1.2 Profile Motif Discovery	17
1.3 Consensus Motif Discovery	18
1.4 Background Models of Random Sequences	19
1.5 The Probability of Candidate Patterns and Their Significance	21
1.6 Statistical Overrepresentation with Scoring Function	21
1.7 The Distribution of Motifs	22
2 Motif Finding Algorithms	23
2.1 Candidate Motifs Extraction	26
2.2 Potential Motifs Scoring	27
2.3 Steps of Motif Finding Algorithms	28
3.1 Consensus	32
3.2 GibbsDNA	33
3.3 MEME-EM Algorithm: Probabilistic Approaches	33
3.4 ANN-Spec	34
3.5 AlignACE	34
3.6 GLAM	35
3.7 The Improbizer	35
3.8 MITRA: Mismatch (Prefix) Tree	35
3.9 MotifSampler: Probabilistic approaches	36
3.10 Oligo/dyad-analysis	36
3.11 QuickScore	36
3.12 SeSiMCMC: “Another Gibbs Sampler digging for DNA-motifs”	37
3.13 WEEDER: Tree-Based Algorithm (Suffix Tree)	37
3.14 YMF	43
4 Assessment of Computational Motif Discovery Tools	43
5 Algorithms for the Challenge Problem	45
Chapter 3 System Architecture	48
1 The Relative Distance between Motif-Background and Non-motif-Background Regions	48
2 System Architecture of Planted Motif Finder (PMF)	50
3 System Architecture of Biological Motif Finder (BMF)	62
3.1 Score Method of the Proposed Algorithm (a three phases’ scoring)	64
3.2 Architectures of the Proposed Method , Weeder and Styczynski et al’s Algorithm	65
4 Features of the Proposed Algorithm	68
Chapter 4 Experimental Results	72
1 The Difference between Motif-Background and Motif-Non-motif regions	72
2 Results of Implanted Motifs	76
3 Results of biological motifs	79
Chapter 5 Discussion	85
1 The Difference of Scoring Functions	85
2 The Advantages of Proposed IMF algorithm	86
3 The Advantages of Proposed BMF algorithm	87
4 SUMMARY	88
Chapter 6 Conclusion	90
1 Conclusion	90
2 Further Work	91
Reference	92
Appendix A(the process of monad problem):	102
Appendix B The Clique Algorithm[16]	105

                                

[1] Abouelhoda MI, Kurtz S, Ohlebusch E (2002). “The enhanced suffix array and its applications to genome analysis”. Proceedings of the Second Workshop on Algorithms in Bioinformatics, Springer-Verlag. LNCS 2452, pp. 449–463
[2] Akutsu T, Arimura H, Shimozono S (2000). “On approximation algorithms for local multiple alignment”. In RECOMB, Tokyo, Japan, page:1-7
[3] Apostolico A (1985). “The myriad virtues of subword trees”. In Combinatorial Algorithms on Words. A. Apostolico and Z. Galil, Eds. NATO ASI Series F: Computer and System Sciences, Springer- Verlag, New York, pp. 85-96
[4] Apostolico A, Gong FC., and Lonardi S (2003). "Verbumculus and the Discovery of Unusual Words". Journal of Computer Science and Technology 19(1): 22–41
[5] Ao W, Gaudet J, Kent WJ, Muttumu S, Mango SE. (2004). “Environmentally induced foregut remodeling by pha-4/foxa and daf-12/nhr”. Science, 305 (5691):1743–6
[6] Bailey TL, Elkan C (1994). “Fitting a mixture model by expectation maximization to discover motifs in biopolymers”. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, Menlo Park, California, AAAI Press,2:page:28-36
[7] Blanchette M, Sinha S.(2001). “Separating real motifs from their artifacts”. Bioinformatics.17 Suppl 1:S30-8
[8] Brazma A, Jonassen I, Vilo J, Ukkonen E(2002). “Pattern Disocovery from Biosequences”. Master Thesis, University of Helsinki
[9] Buhler J, Tompa M.(2001). “Finding motifs using random Projections”. Proceedings of the fifth annual international conference on Computational biology Montreal, Quebec, Canada, Pages: 69 - 76
[10] Burge C, Campbell AM, Karlin S (1992). “Over- and under-representation of short oligonucleotides in DNA sequences”. Proc Natl Acad Sci U S A,15;89(4):1358-1362
[11] Bussemaker HJ, Li H, Siggia ED (2000). “Regulatory element detection using a probabilistic segmentation model”. Proc Int Conf Intell Syst Mol Biol, 8:67-74
[12] Choi J(2004). “A Critical Review of Computational Approaches on Finding Transcriptional Factor Binding Sites”. BIOCHEM218 Final Project Submitted for Credit in Computational Molecular Biology
[13] Chen RM, Liu FM, Tsai JP and Shih SH (2004). “Fmga: finding motifs by genetic algorithm”. In 4th IEEE International Symposium on BioInformatics and BioEngineering, Taichung, Taiwan, 459- 466
[14] Coessens B, Thijs G, Aerts S, Marchal K, De Smet F, Engelen K, Glenisson P, Moreau Y, Mathys J, and De Moor B (2003). "INCLUSive: A web portal and service registry for microarray and regulatory sequence analysis". NucleicAcids Res 31(13): 3468–70
[15] Dermouche A (1995). "A fast algorithm for string matching with mismatches". Information Processing Letters 55(2): 105–110
[16] Dharwadker A (2006). "The Clique Algorithm". From http://www.geocities.
com/ dharwadker/clique/
[17] Eskin E (2004). “From profiles to patterns and back again: a branch and bound. algorithm for finding near optimal motif profiles”. In Proceedings of the eighth annual international conference on Computational molecular biology, ACM Press
[18] Eskin E, Pevzner PA (2002). "Finding composite regulatory patterns in DNA sequences". Bioinformatics 18(1): S354–63
[19] F.N. Abu-Khzam, N.E. Baldwin, M.A. Langston, and N.F. Samatova(2005). “On the Relative Efficiency of Maximal Clique Enumeration Algorithms, with Applications to High-Throughput Computational Biology”. Proceedings, International Conference on Research Trends in Science and Technology, Beirut, Lebanon
[20] Favorov,AV., Gelfand,MS., Mironov,AA, and Makeev, VJ.(2002). “Yet another digging for DNA motifs Gibbs sampler”. Proceedings of BGRS 2002, Novisibisk, 1, 31-33
[21] Favorov AV, Gelfand MS, Gerasimova AV, Ravcheev DA, Mironov AA, and Makeev VJ(2005). “Gibbs sampler for identification of symmetrically structured, spaced dna motifs with improved estimation of the signal length”. Bioinformatics, 21(10):2240–2245
[22] Fogel GB, Porto VW, Weekes DG, Fogel DB,Griffey RH,McNeil JA,Lesnik E, Ecker DJ, Sampath R(2002). "Discovery of RNA structural elements using evolutionary computation". Nucleic Acids Res 30(23): 5310-5317
[23] Fogel GB, Weekes DG, Varga G, Dow ER, Harlow HB(2004). "Discovery of sequence motifs related to coexpression of genes using evolutionary computation". Nucleic Acids Res 32(13): 3826–35
[24] Frith MC, Hansen U, Spouge JL,Weng Z(2004). "Finding functional sequence elements by multiple local alignment". Nucleic Acids Res 32(1): 189–200
[25] Galil Z, Giancarlo R(1986). "Improved string matching with k mismatches". Sigact News 17(4): 52–54
[26] Ganesh R, Siegele DA, Ioerger TR(2003). "MOPAC: motif finding by preprocessing and agglomerative clustering from microarrays". Pac Symp Biocomput: 41–52
[27] Grossi R, Luccio F(1989). "Simple and efficient string matching with k mismatches". Information Processing Letters 33(3): 113–120
[28] Grundy WN, Bailey TL, Elkan CP, Baker ME(1997). “Meta-MEME: motif-based hidden Markov models of protein families”. Comput Appl Biosci. Aug;13(4):397-406
[29] GuhaThakurta D, Palomar L, Stormo GD, Tedesco P, Johnson TE, Walker DW, Lithgow G, Kim S, Link CD(2002). "Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods". Genome Res 12(5): 701–12
[30] Gusfield D(1997). "Rare Events and Conditional Events on Random Strings". Discrete Mathematics and Theoretical Computer Science 6: 191–214
[31] Häußler M, Nicolas J(2005). “Motif Discovery on Promotor Sequences”. Universi at Potsdam
[32] Henry CM Leung and Francis YL Chin(2005). “An Efficient Algorithm for the Extended (l,d)-Motif Problem with Unknown Number of Binding Sites”. BIBE Proceedings of the Fifth IEEE Symposium on Bioinformatics and Bioengineering, Pages: 11 – 18
[33] Hernandez D, Gras R, Appel R(2004). "MoDEL: an efficient strategy for ungapped local multiple alignment". Comput Biol Chem 28(2): 119–28
[34] Hertz GZ, Stormo GD(1999). "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences". Bioinformatics 15(7-8): 563–77
[35] Hertz GZ, Stormo GD(1999). "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences". Bioinformatics 15: 563 77
[36] Hu YJ.(2003). "Finding subtle motifs with variable gaps in unaligned dna Sequences". Comput Methods Programs Biomed 70(1): 11–20
[37] Hughes JD, Estep PW, Tavazoie S, Church GM(2000). "Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae". J Mol Biol 296(5): 1205–14
[38] Jacques VH, Alma. FR and Julio CV(2000). "Discovering regulatory elements in non-coding sequences by analysis of spaced dyads". NucleicAcids Res 28(8): 1808–18
[39] Jensen K., Styczynski M., Rigoutsos I, & Stephanopoulos G.(2005). "A generic motif discovery algorithm for sequential data". Bioinformatics, 22(1):21-28
[40] Keich U and Pevzner PA(2002). "Subtle motifs: defining the limits of motif finding algorithms". Bioinformatics 18(10): 1382–90
[41] Keich U and Pevzner PA(2002). "Finding motifs in the twilight zone". Bioinformatics 18(10): 1374-1381
[42] Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, and Wootton JC (1989). "Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment". Science 262 (5131): 208–14
[43] Liang S(2004). "cWINNOWER algorithm for finding fuzzy dna motifs". J Bioinform Comput Biol 2(1): 47–60
[44]Landau GM, Vishkin U(1986). “Efficient Parallel And Serial Approximate String Matching”. Theoretical Computer Science, Volume 43 , Issue 2-3 ,Pages: 239 -249
[45] Landau G.M. and Vishkin U(1986). "Efficient string matching with k mismatches". Theoretical Computer Science 43(2-3): 239–249
[46] Liu JS(1994). "The collapsed gibbs sampler in bayesian computations with applications to a gene regulation problem". Journal of the American Statistical Association 29(427): 958–967
[47] Liu X, Brutlag DL(2001). "BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes". Pac Symp Biocomput 7: 127–38
[48]MacIsaac KD, Fraenkel E(2006). “Practical Strategies for Discovering Regulatory DNA Sequence Motifs”. PLoS Comput Biol; 2(4): e36
[49] Mahony S, Hendrix D, Golden A, Smith TJ, Rokhsar DS(2005). "Transcription factor binding site identification using the Self-Organizing Map". Bioinformatics 21(9): 1807-14
[50] Manber U, Myers E(1990). "Suffix Arrays: A new method for on-line string searches". First ACM-SIAM Symposium on Discrete Algorithms: 319-327
[51] Marsan L, Sagot MF(2001). “Algorithms for extracting structured motifs using a suffix-tree with application to promoter and regulatory site consensus identification”. J. of Computational Biology, 7:345-360
[52] Marinescu VD; Kohane IS; Riva A(2005). "MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes". BMC Bioinformatics 6:79
[53] Michailidis PD. and Margaritis KG(2002). "On-Line Approximate String Searching Algorithms:Survey and Experimental Results". Intern. J. Computer Math 79(8): 867–888
[54] Navarro G(1997). “A partial deterministic automaton for approximate string matching”. In Proc. of the 4th South American Workshop on String Processing Carleton University Press, Pages 112-124
[55] Navarro G, Raffinot M(1998). “A bit-parallel approach to suffix automata: Fast extended string matching”. In Proc. of the 9th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, Berlin
[56] Navarro G(2001). "NR-grep: a fast and flexible pattern matching tool". Software Practice and Experience 31: 1265-1312
[57] Navarro G(2001). ”A Guided Tour to Approximate String Matching”. CM Computing Surveys (CSUR), Volume 33, Issue 1, pages: 31 - 88
[58] Pavesi G; Mauri G; Pesole G(2001). "An algorithm for finding signals of unknown length in DNA sequences". Bioinformatics 17(1): S207–14. 28, 33
[59] Pavesi G, Mereghetti P, Mauri G, Pesole G(2004). "Weeder Web: discovery of transcription factor binding sites”. Nucleic Acids Research 2(Web Server Issue): W199-W203
[60] Pevzner PA, Waterman MS( 1995). ”Multiple filtration and approximate pattern matching”. Author Affiliation: Univ of Southern California Source: Algorithmica (New York) 13 1-2 Jan-Feb Springer-Verlag New York Inc p 135-154
[61] Pevzner PA, Sze SH(2000). “Combinatorial approaches to finding subtle signals in DNA sequences”. Proc Int Conf Intell Syst Mol Biol, 8:269-78
[62] Pudimat R; Schukat-Talamazzini EG; Backofen R(2005). "A multiple-feature framework for modelling and predicting transcription factor binding sites". Bioinformatics 21(14): 3082–3088
[63] Ricardo AB, Gaston HG(1994). "Fast string matching with k mismatches". Information and Computation 108(2): 187–199
[64] Ricardo AB, Chris HP(1996). "Fast and practical approximate string matching". Information Processing Letters 59(1): 21–27
[65] Ricardo AB, Gaston HG(1992). "A new approach to text searching". Communications of the ACM 35(10): 74–82
[66] Rigoutsos I and Floratos A(1998). "Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm". Bioinformatics 14(1): 55–67
[67] Schbath S(1997). "An efficient statistic to detect over- and underrepresented words in DNA sequences". J Comput Biol 4(2): 189–92
[68] Sinha S(2003). "Discriminative motifs". J Comput Biol 10(3-4): 599–615
[69] Sinha S, Tompa M(2000). “An Statistical Method for Finding Transcription Factor Binding Sites”. In Proceedings of the Eighth International Conference on Interlligent Systems for Molecular Biology, 8:344~354
[70] Sinha, S. & Tompa, M(2000). “YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation”. Nucleic Acids Research, 2003, Vol. 31, No. 13, 3586-3588
[71] Sinha, S. and Tompa, M(2002). “Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation”. Nucleic Acids Research, vol. 30, no. 24, December, 5549-5560
[72] Shinozaki D, Akutsu T, Maruyama O(2003). “Finding optimal degenerate patterns in DNA sequences”. Bioinformatics, 19, Suppl 2:II206-II214
[73] Stefania B, Alessandro C, Andrea B, Cinzia P and Gian AD(2005). "A multistep bioinformatic approach detects putative regulatory elements in gene promoters". BMC Bioinformatics, 6: 121
[74] Schneider T, Stephens M.(1990). “Sequence Logos: A New Way to Display Consensus Sequences”. Nucleic Acids Res, 18:6097-6100
[75] Stormo GD and Hartzell GW(1989). " Identifying DNA and protein patterns with statistically significant alignments of multiple sequences ". Proc Natl Acad Sci U S A 86(4): 1183–7
[76] Stormo GD(2000), “DNA binding sites: representation and discovery”. Bioinformatics, Jan;16(1):16-23
[77] Stormo G. & Hartzell G(1989). ”Identifying protein-binding sites from unaligned
DNA fragments”. Proc Natl Acad Sci U S A, 86 (4), 1183–7
[78] Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos GN(2004). "An extension and novel solution to the (l,d)-motif challenge problem". Genome Informatics 15: 63-71
[79] Sutou T, Tamura K, Mori Y and Kitakami H(2003). “Design and Implementation of Parallel Modified PrefixSpan Method”. Lecture Notes in Computer Science, Volume 2858, Springer Berlin, Heidelberg
[80] Tarhio J, Ukkonen E(1993). "Approximate boyer-moore string matching". SIAM Journal on Computing 22(2): 243–260
[81] Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouzé Pand Moreau Y (2001). "A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling". Bioinformatics 17(12): 1113-1122
[82] Thompson W, Rouchka EC, Lawrence CE(2003). "Gibbs Recursive Sampler: finding transcription factor binding sites". Nucleic Acids Research 31(13): 3580-3585
[83] Tompa M., Li N, Bailey TL. , Church G M. , De Moor B, Eskin E, Favorov A. V, Frith M. C, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, N, Sinha S, Thijs G, Van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, and Zhu Z(2005). "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites, Nature Biotechnology". 23(1): 137 - 144
[84] Tomita E and Seki T(1989). “An Optimal Algorithm for finding all the cliques”. SIG Algorithms, 12, 91–98
[85] Van Helden J, Andre B, Collado-Vides J(1998). "Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies". J Mol Biol 281(5): 827–42
[86] VijayaSatya R. Mukheqee A(2004). “Pruner: algorithms for finding monad patterns in dna sequences”. In 3rd International IEEE Computer Society Computational Systems Bioinformatics Conference, Stanford, CA, USA IEEE Computer Society, pp. 662-665
[87] Wang RL, Tang Z, Cao QP(2003). “An efficient approximation algorithm for finding a maximum clique using Hopfield network learning”. Neural Comput,15(7):1605-19
[88] Wingender E , Dietze P , Karas H and Knüppel R(1996). "TRANSFAC: a database on transcription factors and their DNA binding sites". Nucleic Acids Res 24(1): 238-41
[89] Workman CT, Stormo GD(2000). ”ANN-Spec: a method for discovering transcription factor binding sites with improved specificity”. Pac Symp Biocomput; 467 -78
[90] Wu S, Manber U(1991). “AGREP - A Fast Approximate Pattern-matching Tool”. Proceedings of the Winter USENIX Conference San Francisco, USA, Berkeley, USA, 153-162
[91] Yada T, Totoki Y, Ishikawa M, Asai K and Nakai K(1998). "Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences". Bioinformatics 14(4): 317–325
[92] Zhou Q, Liu JS(2004). "Modeling within-motif dependence for transcription factor binding site predictions". Bioinformatics 20(6): 909–16
[93] Zhu J, Zhang MQ(1999). "SCPD: a promoter database of the yeast Saccharomyces cerevisiae". Bioinformatics 15(7-8): 607-11

簡易檢索 / 詳目顯示

相關論文