Succinct Data Structures for Assembling Large Genomes
Thomas C Conway, Andrew J Bromage

TL;DR
This paper introduces a space-efficient, entropy compressed data structure for de novo genome assembly graphs, significantly reducing memory requirements and improving scalability in handling large eukaryotic genomes.
Contribution
The authors develop a practical, succinct data structure for de Bruijn graphs that reduces storage by at least a factor of 10 and scales better with sequencing errors.
Findings
Requires only 23 GB to store the human genome assembly graph.
Offers at least 10 times less storage than traditional methods.
Shows improved scalability with sequencing errors.
Abstract
Motivation: Second generation sequencing technology makes it feasible for many researches to obtain enough sequence reads to attempt the de novo assembly of higher eukaryotes (including mammals). De novo assembly not only provides a tool for understanding wide scale biological variation, but within human bio-medicine, it offers a direct way of observing both large scale structural variation and fine scale sequence variation. Unfortunately, improvements in the computational feasibility for de novo assembly have not matched the improvements in the gathering of sequence data. This is for two reasons: the inherent computational complexity of the problem, and the in-practice memory requirements of tools. Results: In this paper we use entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Protist diversity and phylogeny
