簡易檢索 / 詳目顯示

研究生: 黃凱麟
Kai-Lin Huang
論文名稱: 應用 Bloom Filter 於基因序列比對之研究
Exploiting Bloom Filter for Solving the Gene Sequence Matching Problem
指導教授: 陳維美
Wei-Mei Chen
口試委員: 陳維美
Wei-Mei Chen
林昌鴻
Chang-Hong Lin
林淵翔
Yuan-Hsiang Lin
吳晉賢
Chin-Hsien Wu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 49
中文關鍵詞: 基因定序平行演算法
外文關鍵詞: Bloom Filter, DNA Sequence
相關次數: 點閱:150下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 提升在 DNA 測序 (DNA sequencing) 問題中判斷一個子字串 k-mer 是否存在於一個 DNA 資料集當中的效率,可以大幅度提升生物資訊學 (bioinformatics) 相關演算法的效能。本論文提出 GBFd 演算法將資料壓縮存儲至多維陣列所建立出的一種變形 Bloom Filter,與 Standard Bloom Filter(SBF) 相比,降低運算時間以及有較少的偽陽性率 (false positive rate)。GBFd 實驗結果顯示,可以有效提升演算法執行效率,加速比對之運算時間。


    Improving the efficiency of determining whether a k-mer exists in a DNA sequencing
    problem can significantly improve the efficiency of bioinformatics related algorithms. Inthis paper, GBFd parallel algorithm is proposed to compress and store data into multidimensional array to build a variant Bloom Filter. Compared with the Standard Bloom Filter(SBF), it reduces the execution time and has lower false positive rate. The experimental results of GBFd show that it can effectively improve the execution efficiency and speed up the execution time on GPU.

    教授推薦書 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I 論文口試委員審定書 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II 論文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV 誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V 目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI 圖目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII 表目錄 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X 1 緒論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 文獻探討 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 使用 Bloom Filter 解決 DNA 定序相關問題 . . . . . . . . . . . . . . . 2 2.2 Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.1 Bloom Filter 基本定義 . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.2 Bloom Filter 偽陽性 . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.3 Bloom Filter 相關研究 . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.4 r-Dimensional Bloom Filter(rDBF) . . . . . . . . . . . . . . . . 6 2.3 GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Host 與 Device . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Grid 與 Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 GPU 記憶體類型 . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.4 Bloom Filter 應用於 GPU 相關研究 . . . . . . . . . . . . . . . 1 3 研究方法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 問題描述 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 演算法介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Read dataset and Initialize parameter . . . . . . . . . . . . . . . 16 3.2.2 Allocating BFd . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 Insert k-mer to BFd . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.4 Lookup k-mer in BFd . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 GBFd 平行演算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Read dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Build BFd in GPU . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Parallel Insert k-mer to BFd . . . . . . . . . . . . . . . . . . . . 23 3.3.4 Parallel Lookup k-mer in BFd . . . . . . . . . . . . . . . . . . . 25 4 實驗分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 實驗環境 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1 合成資料集 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1.2 Bloom Filter 空間 . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.1 Bloom Filter 偽陽性率 . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.2 GBFd 平行演算法實驗結果 . . . . . . . . . . . . . . . . . . . . 31 4.2.3 實驗結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 參考文獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    [1] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970.
    [2] R. Patgiri, S. Nayak, and S. K. Borgohain, “rdbf: A r-dimensional bloom filter for
    massive scale membership query,” Journal of Network and Computer Applications,
    vol. 136, pp. 100–113, 2019.
    [3] R. Rozov, R. Shamir, and E. Halperin, “Fast lossless compression via cascading
    bloom filters,” in BMC bioinformatics, vol. 15, pp. 1–8, BioMed Central, 2014.
    [4] K. Malde and B. O’Sullivan, “Using bloom filters for large scale gene sequence
    analysis in haskell,” in Practical Aspects of Declarative Languages (A. Gill and
    T. Swift, eds.), (Berlin, Heidelberg), pp. 183–194, Springer Berlin Heidelberg, 2009.
    [5] H. Stranneheim, M. Käller, T. Allander, B. Andersson, L. Arvestad, and J. Lundeberg, “Classification of dna sequences using bloom filters,” Bioinformatics, vol. 26,no. 13, pp. 1595–1600, 2010.
    [6] D. Pellow, D. Filippova, and C. Kingsford, “Improving bloom filter performance
    on sequence data using k-mer bloom filters,” Journal of Computational Biology,
    vol. 24, no. 6, pp. 547–557, 2017.
    [7] V. Nygaard, A. Løland, M. Holden, M. Langaas, H. Rue, F. Liu, O. Myklebost,
    Ø. Fodstad, E. Hovig, and B. Smith-Sørensen, “Effects of mrna amplification on
    gene expression ratios in cdna experiments estimated by analysis of variance,” Bmc
    Genomics, vol. 4, no. 1, pp. 1–13, 2003.
    [8] J. R. Miller, S. Koren, and G. Sutton, “Assembly algorithms for next-generation
    sequencing data,” Genomics, vol. 95, no. 6, pp. 315–327, 2010.
    [9] P. Pandey, M. A. Bender, R. Johnson, and R. Patro, “Squeakr: an exact and approximate k-mer counting system,” Bioinformatics, vol. 34, pp. 568–575, 10 2017.
    [10] A.-A. Mamun, S. Pal, and S. Rajasekaran, “KCMBT: a k-mer Counter based on
    Multiple Burst Trees,” Bioinformatics, vol. 32, pp. 2783–2790, 06 2016.
    [11] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Trans. Comput. Syst., vol. 26, jun 2008.
    [12] A. Lakshman and P. Malik, “Cassandra: A decentralized structured storage system,”
    SIGOPS Oper. Syst. Rev., vol. 44, p. 35–40, apr 2010.
    [13] J. H. Mun and H. Lim, “New approach for efficient ip address lookup using a bloom
    filter in trie-based algorithms,” IEEE Transactions on Computers, vol. 65, no. 5,
    pp. 1558–1565, 2016.
    [14] M. Antikainen, T. Aura, and M. Särelä, “Denial-of-service attacks in bloomfilter-based forwarding,” IEEE/ACM Transactions on Networking, vol. 22, no. 5,
    pp. 1463–1476, 2014.
    [15] S. Geravand and M. Ahmadi, “Bloom filter applications in network security: A stateof-the-art survey,” Computer Networks, vol. 57, no. 18, pp. 4047–4064, 2013.
    [16] A. Kirsch and M. Mitzenmacher, “Less hashing, same performance: Building a better bloom filter,” Random Structures & Algorithms, vol. 33, no. 2, pp. 187–218,
    2008.
    [17] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: a scalable wide-area
    web cache sharing protocol,” IEEE/ACM transactions on networking, vol. 8, no. 3,
    pp. 281–293, 2000.
    [18] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher, “Cuckoo filter: Practically better than bloom,” in Proceedings of the 10th ACM International
    on Conference on emerging Networking Experiments and Technologies, pp. 75–88,
    2014.
    [19] P. S. Almeida, C. Baquero, N. Preguiça, and D. Hutchison, “Scalable bloom filters,” Information Processing Letters, vol. 101, no. 6, pp. 255–261, 2007.
    [20] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The case for learned index structures,” in Proceedings of the 2018 international conference on management
    of data, pp. 489–504, 2018.
    [21] F. Xu, P. Liu, J. Xu, J. Yang, and S.-M. Yiu, “Multi-dimensional bloom filter: design and evaluation,” IEICE TRANSACTIONS on Information and Systems, vol. 100,
    no. 10, pp. 2368–2372, 2017.
    [22] A. Appleby, “Murmurhash.” https://sites.google.com/site/murmurhash/,
    2022.
    [23] M. J. Flynn, “Some computer organizations and their effectiveness,” IEEE Transactions on Computers, vol. C-21, no. 9, pp. 948–960, 1972.
    [24] NVIDIA, “CUDA Toolkit Documentation.” https://docs.nvidia.com/cuda/
    index.html, 2022.
    [25] F. Lin, G. Wang, J. Zhou, S. Zhang, and X. Yao, “High-performance ipv6 address
    lookup in gpu-accelerated software routers,” Journal of Network and Computer Applications, vol. 74, pp. 1–10, 2016.
    [26] S. Xiong, Y. Yao, M. Berry, H. Qi, and Q. Cao, “Frequent traffic flow identification through probabilistic bloom filter and its gpu-based acceleration,” Journal of Network and Computer Applications, vol. 87, pp. 60–72, 2017.
    [27] C.-L. Hung, C.-Y. Lin, and P.-C. Wu, “An efficient gpu-based multiple pattern
    matching algorithm for packet filtering,” J. Signal Process. Syst., vol. 86, p. 347–
    358, mar 2017.
    [28] R. Bhat, R. K. Thilak, and R. P. Vaibhav, “Hunting the pertinency of hash and bloom filter combinations on gpu for fast pattern matching,” International Journal of Information Technology, pp. 1–13, 2022.

    無法下載圖示 全文公開日期 2028/01/16 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE