TL;DR
TwoPaCo is a scalable, low-memory algorithm that efficiently constructs compacted de Bruijn graphs from large sets of complete genomes, enabling advanced genomic analyses.
Contribution
It introduces a novel algorithm capable of building the compacted de Bruijn graph from many large genomes quickly and with low memory usage.
Findings
Constructs graphs for 100 simulated human genomes in less than a day.
Builds graphs for eight primate genomes in under two hours.
Enables analysis of hundreds of mammalian genomes.
Abstract
Motivation: De Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes). Results: In this paper, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less then a day and eight real primates in less than two hours, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes. Availability: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo Contact: [email protected]
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
