RLZAP: Relative Lempel-Ziv with Adaptive Pointers
Anthony J. Cox, Andrea Farruggia, Travis Gagie, Simon J. Puglisi and, Jouni Sir\'en

TL;DR
RLZAP introduces an adaptive method for genome compression that improves upon previous algorithms by effectively handling various types of genetic variations while maintaining fast random access.
Contribution
It generalizes existing RLZ-based algorithms to better handle insertions, deletions, and substitutions, achieving improved compression performance.
Findings
Better compression than previous RLZ variants.
Maintains comparable random access times.
Effectively handles diverse genetic variations.
Abstract
Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Plant nutrient uptake and metabolism
