簡易檢索 / 詳目顯示

研究生: 劉嘉傑
Jia-Jie Liu
論文名稱: 在壓縮字串中解序列比對問題:最大共同子序列, 編輯距離及重複序列
Solving Some Sequence Problems on Run-Length Encoded Strings : Longest Common Subsequences, Edit Distances, and Squares
指導教授: 王有禮
Yue-Li Wang
口試委員: 張瑞雄
Ruay-Shiung Chang
陳健輝
Gen-Huey Chen
趙坤茂
Kun-Mao Chao
張肇明
Jou-Ming Chang
黃光璿
Guan-Shieng HUANG
徐俊傑
Chiun-Chieh Hsu
學位類別: 博士
Doctor
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2007
畢業學年度: 96
語文別: 英文
論文頁數: 85
中文關鍵詞: 最大共同子序列編輯距離重複序列
外文關鍵詞: Longest Common Subsequences, Edit Distances, Squares
相關次數: 點閱:223下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在最近十年期間,分子生物目擊也參與一資訊革命,即快速DNA定序(DNA Sequencing)技術的發展。在序列分析上,成對序列比較是一個根本的工作,其提供資料庫搜尋演算法的基礎,該演算法嘗試去決定序列是否具重大的"相似"及"相異"的程度,來判斷它們可能是或不是具同源性。在字串序列比對的問題中,尤其以計算兩字串序列間的最大共同子序列與最小編輯距離為最常見且最為普遍的問題。

    當我們給定了任意兩筆字串X和Y,假設X和Y的字串長度分別為m和n,另外,假設X和Y在經過Run Length Encoding壓縮技術壓縮過後,它們的長度分別為M和N。若我們想要計算X與Y之間的最大共同子序列,在過去己知最好的演算法裡,我們需要O(mN+Mn−mn)的時間,才能求得。在本篇論文中,我們提供了一個時間複雜度為O(min{mN,Mn})的演算法,它能夠更快速地解決這個問題。此外,在最小編輯距離的問題部分,我們另外提供了一個時間複雜度為O(min{mN,Mn})的演算法,它同樣地比目前己知最好的演算法O(mN+Mn)更加優異。

    任何一個字串w,若在其後緊跟另一個相同的w,即出現ww,則我們稱ww為重複序列。在一個任意字串中,去尋找所有的重複序列也是一個非常重要的議題。我們提出了一個全新的演算法,可以直接針對經過Run Length Encoding壓縮技術壓縮過後的字串直接去求得所有的重複序列且只需要花費O(NlogN)的時間。


    Measuring the similarity or difference between two strings is fundamental to
    many applications. In bioinformatics, one has to predict the structures of RNA and proteins, to classify the functions of molecules, to infer the phylogeny of organisms, and to search entries in huge sequence databases. While processing electronic documents, one needs fast and flexible indexing techniques to perform searches. Amongst all of these, similarity search plays an important role. For this purpose, many measures are defined. The longest common subsequence and edit distance are the most popular problems in string processing.

    In this dissertation, we propose an O(min{mN,Mn}) time algorithm for finding
    a longest common subsequence of strings X and Y with lengths m and n, respectively, and run-length-encoded lengths M and N, respectively. We propose a new recursive formula for finding a longest common subsequence of Y and X which is in the run-length-encoded format. That is, Y = y_1y_2 . . . y_n and X = r_1^{l_1} r_2^{l_2} . . . r_M^{l_M}, where r_i is the repeated character of run i and l_i is the number of its repetitions. We improve the previous best known time bound O(mN +Mn−mn). On the other hand, we also improve the time bound to O(min{mN,Mn}) for finding the edit distance between strings X and Y .

    Squares (or tandem repeats) are strings of the form ww where w is an arbitrary
    non-empty string. It plays a central role from word combinatorics and application perspective. Main and Lorentz proposed an O(n log n) time algorithm to find all squares in a string. Their algorithm is optimal when input data are restricted to letters. Based on their result, we show how to locate all squares in a run-length encoded string in time O(NlogN) where N is its runs number. The time complexity of our result is optimal, and it is irrelevant to the length of the original uncompressed string.

    1 Introduction 1.1 Preliminaries 1.2 Longest Common Subsequences 1.3 Edit Distances 1.4 Squares 1.5 Organization of the Dissertation 2 Finding a Longest Common Subsequence between a Run-Length Encoded String and an Uncompressed String 2.1 Some Properties of an LCS Table 2.2 An Efficient Algorithm for Finding the Nearest Originating Value 2.3 The Main Algorithm 2.4 Remarks 3 Edit Distance for a Run-Length Encoded String and an Uncompressed String 3.1 A Sketch of the Algorithm 3.2 An Efficient Algorithm for Finding ED(Xi, y(u,j]) 3.3 Main Idea 3.4 The Main Algorithm 3.5 Remarks 4 Locating All Squares in a Run-Length Encoded String 4.1 Main Ideas 4.2 Suffix Trees 4.3 Find all Squares by Using Suffix Tree 4.4 Find All Squares in a Run-Length Encoded String 5 Square Type with a Run-Length Encoded String 5.1 Square Type 5.2 Some Properties of Square Type with an RLE String 6 Conclusion and Future Works 7 References

    [1] A. Aho, D. Hirschberg, and J. Ullman, Bounds on the complexity of the longest common subsequence problem, Journal of the ACM, 23 (1)(1976) pp. 1–12.
    [2] A. V. Aho, Algorithms for finding patterns in strings, in: J. Van Leeuwen (Ed.), Handbook of Theoretical Computer Science, Elsevier, Amsterdam, 1990.
    [3] Alok Aggarwal and James K. Park, Notes on searching in multidimensional monotone arrays, Proc. 29th Ann. IEEE Symp. Foundations of Comput. Science, (1988) pp. 497–512.
    [4] Alberto Apostolico, F. P. Preparata, Optimal off-line detection of repetitions in a string, Theoretical Computer Science, 22(1983) pp. 297–315.
    [5] Alberto Apostolico, M. Atallah, L. Larmore, and S. Mcfaddin, Efficient parallel algorithms for string editing and related problems, SIAM Journal on Computing, 19 (1990) pp. 968–988.
    [6] Alberto Apostolico, Gad M. Landau, and Steven Skiena, Matching for Run-Length Encoded Strings, Journal of Complexity, 15 (1)(1999) pp. 4–16.
    [7] Ora Arbell, Gad M. Landau, and Joseph S. B. Mitchell, Edit Distance of Run-Length Encoded Strings, Information Processing Letters 83(2002) pp. 307–314.
    [8] Brenda S. Baker and Raffaele Giancarlo, Sparse Dynamic Programming for Longest Common Subsequence from Fragments, Journal of Algorithms, 42 (2002) pp. 231–254.
    [9] Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar, Approximating Edit Distance Efficiently, in: Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, 2004.
    [10] A. van Belkum, S. Scherer, W. van Leeuwen, D. Willemse, L. van Alphen, H. Verbrugh, Variable number of tandem repeats in clinical strains of Haemophilus influenzae, Infection and Immunity, 65 (12) (1997) 5017–5027.
    [11] H. Bunke and J. Csirik, An Improved Algorithm for Computing the Edit Distance of Run Length Coded Strings, Information Processing Letters 54(1995) pp. 93–96.
    [12] Richard P. Brent, The parallel evaluation of general arithmetic expressions, Journal of the Association for Computing Machinery 21 (1974) pp. 201–206.
    [13] Vaclar Chavatal, David A. Klarner, and Donald E. Knuth, Selected combinatorial research problem, Technical Report STAN-CS-72-292 Stanford University, (1972) pp. 26.
    [14] J. G. Chen and R. C. T. Lee, Finding all tandem arrays in DNA sequences, The 23rd workshop on combinatorial mathematics and computation theory, pp. 1–8.
    [15] C. Choffrut and J. Karhumaki, Combinatorics of words, in: G. Rozenberg, A. Salomaa (Eds.), Handbook of Formal Languages, Vol. 1, Springer, Berlin, Heidelberg, New York, 1997, pp. 329–438.
    [16] Richard Cole and Ramesh Hariharan, Approximate string matching: a simpler faster algorithm, in: Proceedings 9th Annu. ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, (1998) pp.463–472.
    [17] Maxime Crochemore, An optimal algorithm for computing the repetitions in a word, Information Processing Letters, 12(5)(1981) pp. 244–250.
    [18] Maxime Crochemore, Recherche lineair d’un carre dan un mot, Comptes Rendus Acad. Sci. Paris Ser, I Math, 296 (1983) pp. 781–784.
    [19] Maxime Crochemore, Transducers and Repetitions, Theoretical Computer Science, 45 (1986) pp. 63–86.
    [20] Maxime Crochemore and Wojciech Rytter, Text Algorithms, Oxford University Press, New York, 1994.
    [21] Maxime Crochemore and Wojciech Rytter, Squares, cubes, and time-space efficient string searching, Algorithmica 13 (1995) pp. 405–425.
    [22] Maxime Crochemore and Wojciech Rytter, Jewels of Stringology, World Scientific, Singapore, 2002.
    [23] Maxime Crochemore, GadM. Landau, andMichal Ziv-Ukelson, A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices, SIAM Journal of Computing 32(6)(2003) pp. 1654–1673.
    [24] E. Delacourt, J. F. Myoupo, D. Smem, A constant time parallel detection of repetitions, Parallel Processing Letters, 9(1)(1999) pp. 81–92.
    [25] David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F. Italiano, Sparse Dynamic Programming I: Linear Cost Functions, Journal of the ACM, 39 (3) (1992) pp. 519–545.
    [26] P. Le Fleche, Y. Hauck, L. Onteniente, F. Prieur, A. Denoeud, V. Ramisse, P. Sylvestre, G. Benson, F. Ramisse, and G. Vergnaud, A tandem repeats database for bacterial genomes: application to the genotyping of Yersinia pestis and Bacillus anthracis, BMC Microbiol., 1 (1) (2001) pp. 2.
    [27] Avieri S. Fraenkel and Jamie Simpson, Note How Many Squares Can a String Contain ?, Journal of Combinatorial Theory, Series A 82 (1998) pp. 112–120.
    [28] Valerio Freschi and Alessandro Bogliolo, Longest Common Subsequence between Run-Length-Encoded Strings: a New Algorithm with Improved Parallelism, Information Processing Letters, 90 (4) (2004) pp. 167–173.
    [29] Zvi Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching, Journal of Complexity 4(1988) pp. 33–72.
    [30] Zvi Galil and J. Seiferas, Time-space optimal string matching, Journal Comput. System Sciences 26 (3) (1983) pp. 280–294.
    [31] Zvi Galil and Kunsoo Park, An Improved Algorithm for Approximate String Matching, SIAM Journal on Computing 19(6)(1990) pp. 989–999.
    [32] Thierry Garcia, David Seme, A Coarse-Grained Multicomputer algorithm for the detection of repetitions, Information Processing Letters, 93 (2005) pp. 307– 313.
    [33] Dan Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, 1997.
    [34] Dan Gusfield and Jens Stoye, Linear time algorithm for finding and representing all the tandem repeats in a string, Technical Report CSE-98-4, Computer Science Department, University of California, Davis, 1998.
    [35] Dan Gusfield and Jens Stoye, Linear time algorithm for finding and representing all the tandem repeats in a string, Journal of Computer and System Sciences, 69 (2004) pp. 525–546.
    [36] Daneiel S. Hirschberg, A linear space algorithm for computing maximal common subsequences, Communications of the ACM, 18 (6) (1975) pp. 341– 343.
    [37] Daneiel S. Hirschberg, Algorithms for the Longest Common Subsequence Problem, Journal of the ACM, 24 (4) (1977) pp. 664–675.
    [38] James W. Hunt and Thomas G. Szymanski, A Fast Algorithm for Computing Longest Common Subsequences, Communications of the ACM, 20 (5) (1977) pp. 350–353.
    [39] Lucian Ilie, Note A simple proof that a word of length n has at most 2n distinct squares, Journal of Combinatorial Theory, Series A 112 (2005) pp. 163–164.
    [40] Richard Johnsonbaugh and Marcus Schaefer, Algorithms, Pearson Education Upper Saddle River, New Jersey 07458.
    [41] D. E. Knuth, J. H. Morris and V. R. Pratt, Fast pattern-matching in string, SIAM Journal on Computing, 6, (1977), pp. 323–350.
    [42] R. Kolpakov and G. Kucherov, Finding maximal repetitions in a word in linear time, in the 40th Annual Symposium on Foundations of Computer Science, 1999, pp. 596–604.
    [43] R. Kolpakov and G. Kucherov, Finding repeats with fixed gap, in Proceedings of the Seventh International Symposium on String Processing and Information Retrieval, 2000, pp. 162–168.
    [44] V. I. Levenshtein, Binary Codes Capable of Correcting, Deletions, Insertions and Reversals, Soviet Physics Doklady 10 (1965) pp. 707–710.
    [45] R. C. T. Lee, R. C. Chang, S. S. Tseng, and Y. T. Tsai, Introduction to the Design and Analysis of Algorithms Second Edition.
    [46] J. Van Leeuwen (Ed.), Handbook of Theoretical Computer Science, Elsevier, Amsterdam, 1993.
    [47] H. F. Leung, Z. S. Peng, and H. F. Ting, An efficient algorithm for online square detection, Theoretical Computer Science, 363 (2006) pp. 69–75.
    [48] Mi Lu, A parallel algorithm for longest-common-subsequence computing, Proc. Int. Conference Computing and Information (1990) pp. 372–377.
    [49] Mi Lu and Hua Lin, Parallel algorithms for Longest Common Subsequence Problem, IEEE Transaction on Parallel and Distributed Systems 5 (8)(1994) pp. 835–848.
    [50] M. Lothaire, Combinatorics on Words, in: Encyclopedia of Mathematics and its Applications vol. 17, Addison-Wesley, Reading, MA, 1983.
    [51] Michael G. Main and Richard J. Lorentz, An O(n log n) Algorithm for Finding All Repetitions in a String, Technical Report CS-79-056, Washington State University, Pullman, 1979.
    [52] Michael G. Main and Richard J. Lorentz, An O(n log n) Algorithm for Finding All Repetitions in a String, Journal of Algorithms, Vol. 5, 1984, pp. 422–432.
    [53] Michael G. Main and Richard J. Lorentz, Linear time recognition of squarefree strings, in A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, volume F12 of NATO ASI Series, Springer, Berlin, 1985, pp. 271–278.
    [54] Michael G. Main, Detecting leftmost maximal periodicities, Discrete Applied Mathematics, 25 (1989) pp. 145–153.
    [55] T. Mathies, A fast parallel algorithm to determine edit distance, Technical Report CMU-CS-88-130, Department of Computer Science, Carnegie Mellon University Pittsburgh, PA, (1988).
    [56] E. M. McCreight, A space-economical suffix tree construction algorithm, Journal of the ACM 23 (1976) pp. 262–272.
    [57] J. Modelevsky, Computer applications in applied genetic Engineering, Advances in applied Microbiology 30 (1984) pp. 169–195.
    [58] G. Myers, A fast bit-vector algorithm for approximate string matching base on dynamic programming, Journal of the ACM 46 (3) (1999) pp. 395–415.
    [59] S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology 48(3)(1970) pp. 443–453.
    [60] P. Pevzner, Computational Molecular Biology, Elsevier Science Ltd. 2003.
    [61] K. Sayoood, E. Fow (Eds.), Introduction to Data Compression, second edition, Morgan Kaufmann, Los Altos, CA, 2000.
    [62] B. Schieber, U. Vishkin, On finding lowest common ancestors: simplification and parallelization, SIAM Journal on Computing 17 (6) (1988) pp. 1253–1262.
    [63] P. Sellers, The theory and computation of evolutionary distances: pattern recognition, Journal of Algorithms 1 (1980) pp. 359–373.
    [64] A. O. Slisenko, Detection of periodicities and string-matching in real time, Journal of Mathematical Sciences, 22 (3) (1983) pp. 1316–1387.
    [65] J. Stoye and D. Gusfield, Simple and flexible detection of contiguous repeats using a suffix tree, Theoretical Computer Science, 270 (2002) pp. 843–856.
    [66] J. Tarhio and E. Ukkonen, Approximate Boyer-Moore String Matching, SIAM Journal on Computing 22(2)(1993) pp. 243–260.
    [67] A. Thue, Uber unendlich Zeichenreihen, Norske Vid. Selsk. Skr. I, Mat. Nat. Kl. Christiana 7, (1906), 1-22. Reprinted in T. Nagell, A. Selberg, S. Selberg, and K. Thalberg (Eds.), Selected Mathematical Papers of Axel Thue. Oslo, Norway: Universitetsforlaget, (1997) pp. 139–158.
    [68] E. Ukkonen, Algorithms for Approximate String Matching, Information and Control 64(1985) pp. 100–118.
    [69] E. Ukkonen, On-line construction of suffix-tree, Algorithmica 14(1995) pp. 249– 260.
    [70] R. A. Wagner and M. J. Fischer, The string-to-string correction problem, Journal of the ACM 21(1)(1974) pp. 168–173.
    [71] P. Weiner, Linear pattern matching algorithms, Proceedings of the 14th IEEE Symp. on Switching and Automata Theory (1973) pp. 1–11.
    [72] A. Wright, Approximate string matching using within-word parallelism, Software Practice and Experience 24 (4) (1994) pp. 337–362.
    [73] S. Wu and U. Manber, Fast text searching allowing errors, Communications of the ACM 35 (10) (1992) pp. 83–91.
    [74] J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Transaction on Information Theory 2 24(1978) pp. 530–536.

    QR CODE