A Compressed Self-Index for Genomic Databases

Travis Gagie; Juha K\"arkk\"ainen; Yakov Nekrich; Simon J.; Puglisi

arXiv:1111.1355·cs.DS·November 8, 2011·1 cites

A Compressed Self-Index for Genomic Databases

Travis Gagie, Juha K\"arkk\"ainen, Yakov Nekrich, Simon J., Puglisi

PDF

Open Access

TL;DR

This paper introduces a compressed self-index for genomic databases that leverages relative Lempel-Ziv compression to efficiently store, search, and access large collections of similar genomes.

Contribution

It extends RLZ compression by enabling fast search capabilities, creating an efficient self-index for large genomic datasets.

Findings

01

Achieves good compression ratios for genomic data

02

Supports fast random access to stored genomes

03

Enables efficient search within compressed genomic collections

Abstract

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals' genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv (RLZ) compression takes advantage of this property: it stores the first genome uncompressed or as an FM-index, then compresses the other genomes with a variant of LZ77 that copies phrases only from the first genome. RLZ achieves good compression and supports fast random access; in this paper we show how to support fast search as well, thus obtaining an efficient compressed self-index.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · Fractal and DNA sequence analysis