A grammar compressor for collections of reads with applications to the   construction of the BWT

Diego D\'iaz-Dom\'inguez; Gonzalo Navarro

arXiv:2011.07999·cs.DS·November 17, 2020

A grammar compressor for collections of reads with applications to the construction of the BWT

Diego D\'iaz-Dom\'inguez, Gonzalo Navarro

PDF

1 Repo

TL;DR

This paper introduces a grammar-based compression method for DNA sequencing reads that enables direct computation of the BWT, facilitating efficient genomic analyses in limited space.

Contribution

The authors present a novel grammar compression technique for read collections that allows direct BWT computation, outperforming some existing methods in space and time efficiency.

Findings

01

Achieves comparable space reduction to LZ-based methods

02

Outperforms entropy-based compression approaches

03

Requires less working space and time in experiments

Abstract

We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments in real data showed that the space reduction we achieve with our compressor is competitive with LZ-based methods and better than entropy-based approaches. Compared to other popular grammars, in this kind of data, we achieve, on average, 12\% of extra compression and require less working space and time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://bitbucket.org/DiegoDiazDominguez/lms_grammar
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.