Optimal-Time Mapping in Run-Length Compressed PBWT
Paola Bonizzoni, Davide Cozzi, Younan Gao

TL;DR
This paper introduces a new data structure for the multi-allelic PBWT that supports constant-time forward and backward steps within run-length encoded space, improving haplotype retrieval and prefix search efficiency.
Contribution
It presents the first $O( ewR)$-word data structure supporting both forward and backward steps in the run-length encoded PBWT, extending its applicability to general alphabets.
Findings
Supports constant-time forward and backward stepping in run-length encoded PBWT.
Enables haplotype retrieval in $O(rac{ ewR}{ ext{word}} imes ext{log log}_ ext{word} h + ext{width})$ time.
Provides efficient prefix search with $O( ext{height} + ewR)$ space and $O(m' imes ext{log log}_ ext{word} \sigma + ext{occ})$ query time.
Abstract
The Positional Burrows--Wheeler Transform (PBWT) is a data structure designed for efficiently representing and querying large collections of sequences, such as haplotype panels in genomics. Forward and backward stepping operations -- analogues to LF- and FL-mapping in the traditional BWT -- are fundamental to the PBWT, underpinning many algorithms based on the PBWT for haplotype matching and related analyses. Although the run-length encoded variant of the PBWT (also known as the -PBWT) achieves -word space usage, where is the total number of runs, no data structure supporting both forward and backward stepping in constant time within this space bound was previously known. In this paper, we consider the multi-allelic PBWT that is extended from its original binary form to a general ordered alphabet . We first establish bounds on the size…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genetic Associations and Epidemiology · Genomics and Phylogenetic Studies
