Simulating the DNA String Graph in Succinct Space
Diego D\'iaz-Dom\'inguez, Travis Gagie, Gonzalo Navarro

TL;DR
This paper introduces rBOSS, a new succinct data structure that simulates the DNA string graph, enabling efficient genome assembly with improved contig sizes compared to traditional de Bruijn graphs.
Contribution
We propose rBOSS, a versatile data structure that simulates string graphs using succinct space, adaptable to various biological analyses and capable of efficient genome assembly.
Findings
rBOSS assembles 185 MB of reads in under 15 minutes
Produces contigs averaging over 10,000 base pairs
Outperforms fixed-length de Bruijn graphs in contig size
Abstract
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper, we propose a new data structure we call rBOSS, which gets close to that ideal. Our rBOSS is a de Bruijn graph in practice, but it simulates any length up to k and can compute overlaps of size at least m between the labels of the nodes, with k and m being parameters. If we choose the parameter k equal to the size of the reads, then we can simulate a complete string graph. As most BWT-based structures, rBOSS is unidirectional, but it exploits the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · DNA and Biological Computing
