PFP Data Structures

Christina Boucher; Ond\v{r}ej Cvacho; Travis Gagie; Jan Holub,; Giovanni Manzini; Gonzalo Navarro; Massimiliano Rossi

arXiv:2006.11687·cs.DS·June 23, 2020

PFP Data Structures

Christina Boucher, Ond\v{r}ej Cvacho, Travis Gagie, Jan Holub,, Giovanni Manzini, Gonzalo Navarro, Massimiliano Rossi

PDF

Open Access 2 Repos

TL;DR

This paper enhances the prefix-free parsing (PFP) data structure to efficiently support key string queries like LCE, SA, LCP, and BWT, enabling faster processing of large genomic datasets.

Contribution

It introduces a PFP data structure that supports fast queries and demonstrates its practical efficiency on large, repetitive genomic data.

Findings

01

Supports LCE, SA, LCP, BWT queries in linear space

02

Constructs efficiently for large genomic datasets

03

Achieves near real-time query performance

Abstract

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$ , it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $BWT (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $∣ PFP (S) ∣$ . In practice $D$ and $P$ are significantly smaller than $S$ and computing $BWT (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ consists of genomes from individuals of the same species. In this paper, we consider $PFP (S)$ as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in $O (∣ PFP (S) ∣)$ space: longest common extension (LCE), suffix array (SA), longest common…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Genomics and Phylogenetic Studies