Hierarchical Relative Lempel-Ziv Compression
Philip Bille, Inge Li G{\o}rtz, Simon J. Puglisi, and Simon R. Tarnow

TL;DR
This paper introduces a hierarchical RLZ compression method that uses multiple references organized in a tree structure, significantly improving compression efficiency for genomic datasets with minimal impact on decompression speed.
Contribution
It proposes a novel hierarchical approach for RLZ compression using a tree of references, enhancing compression of similar string collections like genomes.
Findings
Twofold compression improvement on bacterial genomes
Negligible increase in decompression time
Effective hierarchy construction using sparse graphs and locality sensitive hashing
Abstract
Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string is compressed relative to a second string (called the reference) by parsing into a sequence of substrings that occur in . RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such data sets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we form a rooted tree (or hierarchy) on the strings and then compressed each string using RLZ with parent as reference, storing only the root of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
