研究生: |
蘇柏丞 Pai-Chen Su |
---|---|
論文名稱: |
在GPU上進行巨量生物序列比對運算之策略研究 A Sequence Alignment Strategy for Large-Scale Biological Sequences Using GPU |
指導教授: |
陳省隆
Hsing-Lung Chen |
口試委員: |
莊博任
Po-Jen Chuang 吳乾彌 Chen-Mie Wu 呂政修 Jenq-Shiou Leu |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 60 |
中文關鍵詞: | 巨量序列 、序列 、比對 、多處理 、緒 、GPU 、平行演算法 |
外文關鍵詞: | huge sequences, sequence alignment, multi-threading |
相關次數: | 點閱:229 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
序列比對是生物資訊研究中最基本且相當重要的研究工具,它可以比較及分析出兩條或多條序列之間的相似程度,藉此推測是否源自共同的祖先。
在生物序列比對中,經常需要與多個不同序列進行比對,才能找出相似之結果,通常輸入的兩個字串的長度都很長,因此所需的處理時間相當久,有很多演算法被研究出來以降低處理時間。例如Smith - Waterman algorithm是一種local alignment的序列比對方法,但是這個演算法的時間複雜度和空間複雜度都需要二次方(N2)的時間和空間,特別是基因比對資料所佔的容量特別大時,這個方法就行不通了,行不通的原因是因為它佔了很大的記憶體與硬碟空間,無法一次將所有資訊直接存起來,且需要龐大的運算。
所以在運算部分,我們以Smith-Waterman 與 Needleman-Wunsch演算法做基礎,並使用GPU做加速,與CPU同時進行平行處理,運行在GPU上的程式稱為核心函式(kernel)[1],利用CUDA多個串流(stream)來一次呼叫(launch)多個核心函式,將整個矩陣做多次切割,再個別運算,使運算時間盡可能縮短。
In bioinformatics, a sequence alignment is the most basic and quite important research tool. It is a way of identifying regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. We can speculate the sequences whether or not derived from a common ancestor. In sequence alignment, we often need many times to compare the results. Typically the length of the two input strings are very long, so the long processing time is required. There are many algorithms to reduce the processing time like Smith-Waterman. Smith-Waterman algorithm is a local alignment comparison method, but its time and space complexity is O(n²). This algorithm would not work when we alignment of two DNA sequences. Because this algorithm need a large amount of memory and disk space. It is unable to save all information and needs huge processing time.
Based on Smith-Waterman and Needleman-Wunsch algorithm, we use GPU acceleration and parallel processing. The use of CUDA streams to launch kernels[1]. We repeatedly cut the entire matrix then individual operation. We let the processing time as short as possible.
[1] 張舒, 褚豔利, 趙開勇, 張鈺勃, and 陳乃塘, GPU 高效能運算之CUDA vol. 碁峰, 2011.
[2] A. Apostolico and Z. G. (eds.), Pattern Matching Algorithms vol. Oxford University Press, 1997.
[3] M. Crochemore and W. Rytter, Text algorithms vol. Oxford University Press, 1994.
[4] G. A. Stephen, String Searching Algorithm vol. World Scientific, 1994.
[5] D. Gusfield, Algorithms on Strings, Tree and Sequences vol. Cambridge University Press, Cambridge, 1997.
[6] W. R. Pearson and D. J. Lipman, "Improved Tools for Biological Sequence Comparison," Proc. Natl. Acad. Sci. USA, Vol. 85, pp. 2444-2448, April 1988.
[7] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic Local Alignment Search Tool," J. Mol. Biol., vol. 215, pp. 403-410, 1990.
[8] M. Vlachos, D. Gunopulos, and G. Kollios, "Robust Similarity Measures for Mobile Object Trajectories," Proc. DEXA., pp. 721-728, 2002.
[9] H. Fashandi and A. M. E. Moghaddam, "A new rotation invariant similarity measure for trajectories," in 2005 International Symposium on Computational Intelligence in Robotics and Automation, 2005, pp. 631-634.
[10] T. F. Smith and M. S. Waterman, "Identification of Common Molecular Subsequences," J. Molecular Biology, vol. 147, no. 1, pp. 195-197, Mar. 1981.
[11] O. Gotoh, "An Improved Algorithm for Matching Biological Sequences," J. Molecular Biology, vol. 162, no. 3, pp. 705-708, Dec. 1982.
[12] Y. Liu, W. Huang, J. Johnson, and S. Vaidya, "GPU Accelerated Smith-Waterman," Proc. Sixth Int'l Conf. Computational Science (ICCS), vol. 3994, pp. 188-195, 2006.
[13] W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-Wittig, "Bio-Sequence Database Scanning on a GPU," Proc. 20th Int'l Conf.Parallel and Distributed Processing (IPDPS), 2006.
[14] S. Manavski and G. Valle, "CUDA Compatible GPU Cards as Efficient Hardware Accelerators for Smith-Waterman Sequence Alignment," BMC Bioinformatics, 9(Suppl 2), 2008.
[15] E. F. d. O. Sandes and A. C. M. A. d. Melo, "Retrieving Smith-Waterman Alignments with Optimizations for Megabase Biological Sequences Using GPU," IEEE Transactions on Parallel and Distributed Systems, vol. 24, pp. 1009-1021, 2013.
[16] R. Inam, "An Introduction to GPGPU Programming-CUDA Architecture," Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se.
[17] J. Ghorpade, J. Parande, M. Kulkarni, and A. Bawaskar, "GPGPU PROCESSING IN CUDA ARCHITECTURE," Advanced Computing: An International Journal ( ACIJ), vol. 3, No.1, January 2012.
[18] N. Corporation, "CUDA C PROGRAMMING GUIDE," PG-02829-001_v7.5, September 2015.
[19] D. B. Kirk and W.-m. W. Hwu, Programming Massively Parallel Processors A Hands-on Approach Second Edition, 2013.
[20] W. J. Masek and M. S. Paterson, "A faster algorithm for computing string edit distance," Journal of Computer and System Sciences, pp. 18-31, 1980.
[21] S. Wu and U. Manber, "Fast Text Searching: Allowing Errors," Communications of the ACM, vol. 35, no. 10, pp. 83-91, October 1992.
[22] A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, N. Lewenstein, and M. Rodeh, "Text Indexing and Dictionary Matching with One Error," Journal of Algorithms, vol. 37, no. 2, pp. 309-325, November 2000.
[23] V. I. Levenshtein, "Binary codes capable of correcting deletions, insertions and reversals," Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966.
[24] K. M. Caho, W. R. Pearson, and W. Miller, "Aligning Two Sequences within a Specified Diagonal Band," Computer applications in the biosciences, vol. 8, no. 5, pp. 481-487, 1992.
[25] K. M. Chao, "Dynamic-programming strategies for analyzing biomolecular sequences," Department of Computer Science and Information Engineering, National Taiwan University, February 2003.
[26] W. R. Pearson and W. Miller, "Dynamic Programming Algorithms for Biological Sequence Comparison," Methods in Enzymology, vol. 210, pp. 575-601, 1992.
[27] D. Sankoff and J. B. Kruskal(eds.), "Time Warps, String Edits, and Macromolecules," The Theory and Practice of Sequence Comparison, 1983.
[28] A. S. Deshpande, D. S. Richards, and W. R. Pearson, "A Platform for Biological Sequence Comparison on Parallel Computers," Computer Applications in the Biosciences, vol. 7, no. 2, pp. 237-247, 1991.
[29] J. Setubal and J. Meidanis, "Sequence comparison and dataBase search," In Introduction To Computational Molecular Biology, pp. 47-103, 1997.
[30] T. F. Smith and M. S. Waterman, "Comparison of biosequences," Advances in Applied Mathematics, vol. 2, pp. 482-489, 1981.
[31] X. Huang and W. Miller, "A time-efficient, linear-space local similarity algorithm," Advances in Applied Mathematics, vol. 12, no. 3, pp. 337-357, 1991.
[32] S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, et al., "Gapped BLAST and PSI-BLAST: A new generation of protein dataBase search programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[33] S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins," Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, March 1970.
[34] J. D. Thompson, D. G. Higgins, and T. J. Gibson, "Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic Acids Research, vol. 22, no. 22, pp. 4673-4680, 1994.
[35] T. Oliver, B. Schmidt, and D. Maskell, "Hyper Customized Processors for Bio-Sequence DataBase Scanning on FPGAs," ACM/SIGDA 13th International Symposium on Field Programmable Gate Arrays, pp. 229-237, 2005.
[36] B. West, R. D. Chamberlain, R. S. Indeck, and Q. Zhang, "An FPGA-Based Search Engine for Unstructured DataBase," In Proc. of 2nd Workshop on Application Specific Processors, pp. 25-32, December 2003.
[37] T. V. Court and M. C. Herbordt, "Families of FPGA-Based Accelerators for Approximate String Matching," MicroProcessors and Microsystems, vol. 31, no. 2, pp. 135-145, March 2007.
[38] S. Dydel and P. Bala, "Large scale protein sequence alignment using FPGA reprogrammable logic devices," LNCS Proceedings of FPL., pp. 23–32, 2004.
[39] M. C. Herbordt, J. Model, Y. Gu, B. Sukhwani, and T. VanCourt, "Single Pass, BLAST-Like, Approximate String Matching on FPGAs," 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 217-226, April 2006.
[40] E. W. Myers and W. Miller, "Optimal Alignments in Linear Space," Computer Applications in the Biosciences, vol. 4, no. 1, pp. 11-17, 1988.