Prefix-free parsing for building large tunnelled Wheeler graphs

Adri\'an Goga; Andrej Bal\'a\v{z}

arXiv:2206.15097·cs.DS·August 28, 2023

Prefix-free parsing for building large tunnelled Wheeler graphs

Adri\'an Goga, Andrej Bal\'a\v{z}

PDF

1 Repo

TL;DR

This paper introduces a novel method combining Wheeler graphs and prefix-free parsing to efficiently build compact, space-saving indexes for large, repetitive genomic datasets, enabling practical pangenomic references.

Contribution

It presents a new approach that uses prefix-free parsing to accelerate and reduce memory usage in constructing Wheeler graph-based indexes for large genomic collections.

Findings

01

Faster construction of Wheeler graphs with less memory.

02

Effective compression of large repetitive texts.

03

Enabling practical pangenomic indexing.

Abstract

We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fmfi-compbio/pfp_wg
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.