Fast Processing and Querying of 170TB of Genomics Data via a Repeated   And Merged BloOm Filter (RAMBO)

Gaurav Gupta; Minghao Yan; Benjamin Coleman; Bryce Kille; R. A. Leo; Elworth; Tharun Medini; Todd Treangen; Anshumali Shrivastava

arXiv:1910.04358·q-bio.GN·May 3, 2022

Fast Processing and Querying of 170TB of Genomics Data via a Repeated And Merged BloOm Filter (RAMBO)

Gaurav Gupta, Minghao Yan, Benjamin Coleman, Bryce Kille, R. A. Leo, Elworth, Tharun Medini, Todd Treangen, Anshumali Shrivastava

PDF

1 Repo

TL;DR

The paper introduces RAMBO, a novel data structure that enables fast, parallel, and memory-efficient search in massive genomic datasets, significantly outperforming existing methods in speed and scalability.

Contribution

RAMBO is a new set membership data structure for genomics that offers faster query times, supports parallel updates, and handles large-scale data efficiently.

Findings

01

RAMBO achieves 9-hour indexing of 170TB genomic data on 100 nodes.

02

It outperforms state-of-the-art methods in query speed.

03

It maintains low false-positive and zero false-negative rates.

Abstract

DNA sequencing, especially of microbial genomes and metagenomes, has been at the core of recent research advances in large-scale comparative genomics. The data deluge has resulted in exponential growth in genomic datasets over the past years and has shown no sign of slowing down. Several recent attempts have been made to tame the computational burden of sequence search on these terabyte and petabyte-scale datasets, including raw reads and assembled genomes. However, no known implementation provides both fast query and construction time, keeps the low false-positive requirement, and offers cheap storage of the data structure. We propose a data structure for search called RAMBO (Repeated And Merged BloOm Filter) which is significantly faster in query time than state-of-the-art genome indexing methods- COBS (Compact bit-sliced signature index), Sequence Bloom Trees, HowDeSBT, and SSBT.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gaurav16gupta/rambo_msmt
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.