Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Jason Pell; Arend Hintze; Rosangela Canino-Koning; Adina Howe; James; M. Tiedje; C. Titus Brown

arXiv:1112.4193·q-bio.GN·June 3, 2015·Proc. Natl. Acad. Sci. USA

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James, M. Tiedje, C. Titus Brown

PDF

TL;DR

This paper introduces a memory-efficient probabilistic graph representation using Bloom filters for metagenome assembly, significantly reducing memory usage while maintaining accuracy, enabling more feasible analysis of complex microbial communities.

Contribution

The paper presents a novel probabilistic data structure for DNA assembly graphs that drastically reduces memory requirements for metagenomic assembly.

Findings

01

Achieves up to 40-fold reduction in memory usage for metagenome assembly

02

Accurately represents DNA assembly graphs with low memory footprint

03

Enables analysis of complex microbial ecosystems with limited computational resources

Abstract

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for {\em de novo} assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for {\em de novo} assembly of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.