PFP Data Structures
Christina Boucher, Ond\v{r}ej Cvacho, Travis Gagie, Jan Holub,, Giovanni Manzini, Gonzalo Navarro, Massimiliano Rossi

TL;DR
This paper enhances the prefix-free parsing (PFP) data structure to efficiently support key string queries like LCE, SA, LCP, and BWT, enabling faster processing of large genomic datasets.
Contribution
It introduces a PFP data structure that supports fast queries and demonstrates its practical efficiency on large, repetitive genomic data.
Findings
Supports LCE, SA, LCP, BWT queries in linear space
Constructs efficiently for large genomic datasets
Achieves near real-time query performance
Abstract
Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string , it produces a dictionary and a parse of overlapping phrases such that can be computed from and in time and workspace bounded in terms of their combined size . In practice and are significantly smaller than and computing from them is more efficient than computing it from directly, at least when consists of genomes from individuals of the same species. In this paper, we consider as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in space: longest common extension (LCE), suffix array (SA), longest common…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Genomics and Phylogenetic Studies
