A Compressed Self-Index for Genomic Databases
Travis Gagie, Juha K\"arkk\"ainen, Yakov Nekrich, Simon J., Puglisi

TL;DR
This paper introduces a compressed self-index for genomic databases that leverages relative Lempel-Ziv compression to efficiently store, search, and access large collections of similar genomes.
Contribution
It extends RLZ compression by enabling fast search capabilities, creating an efficient self-index for large genomic datasets.
Findings
Achieves good compression ratios for genomic data
Supports fast random access to stored genomes
Enables efficient search within compressed genomic collections
Abstract
Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals' genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · Fractal and DNA sequence analysis
