TL;DR
This paper introduces a grammar-based compression method for DNA sequencing reads that enables direct computation of the BWT, facilitating efficient genomic analyses in limited space.
Contribution
The authors present a novel grammar compression technique for read collections that allows direct BWT computation, outperforming some existing methods in space and time efficiency.
Findings
Achieves comparable space reduction to LZ-based methods
Outperforms entropy-based compression approaches
Requires less working space and time in experiments
Abstract
We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments in real data showed that the space reduction we achieve with our compressor is competitive with LZ-based methods and better than entropy-based approaches. Compared to other popular grammars, in this kind of data, we achieve, on average, 12\% of extra compression and require less working space and time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
