Reference Sequence Construction for Relative Compression of Genomes

Shanika Kuruppu; Simon Puglisi; Justin Zobel

arXiv:1106.3791·q-bio.QM·June 21, 2011

Reference Sequence Construction for Relative Compression of Genomes

Shanika Kuruppu, Simon Puglisi, Justin Zobel

PDF

Open Access

TL;DR

This paper investigates using repeat dictionaries generated by algorithms like Comrad, Re-pair, and Dna-x as reference sequences for relative compression of genomic data, improving compression efficiency and maintaining fast random access.

Contribution

It introduces a novel approach of utilizing repeat dictionaries as reference sequences, enhancing compression of repetitive genomic datasets with relative compression methods.

Findings

01

Better compression achieved with repeat dictionaries.

02

Supports rapid random access to compressed data.

03

Applicable to diverse repetitive datasets.

Abstract

Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In this paper, we explore using the dictionary of repeats generated by Comrad, Re-pair and Dna-x algorithms as reference sequences for relative compression. We show this technique allows better compression and supports random access just as well. The technique also allows more general repetitive datasets to be compressed using relative compression.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · Genomics and Phylogenetic Studies