Suffix Arrays for Spaced-SNP Databases

Travis Gagie

arXiv:1407.0114·cs.DS·July 2, 2014

Suffix Arrays for Spaced-SNP Databases

Travis Gagie

PDF

Open Access

TL;DR

This paper introduces a method to efficiently store and query genomic databases with SNP variations using compressed suffix arrays, leveraging the uniqueness of substrings between SNPs.

Contribution

It presents a novel approach to compress and index SNP-rich genomic data by exploiting the structure of SNPs and unique substrings, improving storage and query efficiency.

Findings

01

Enables fast compressed suffix array construction for SNP databases

02

Reduces storage requirements for genomic data with SNP variations

03

Maintains efficient query performance on SNP-rich genomes

Abstract

Single-nucleotide polymorphisms (SNPs) account for most variations between human genomes. We show how, if the genomes in a database differ only by a reasonable number of SNPs and the substrings between those SNPs are unique, then we can store a fast compressed suffix array for that database.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Fractal and DNA sequence analysis