# Lossless Pangenome Indexing Using Tag Arrays

**Authors:** Parsa Eskandar, Benedict Paten, Jouni Sirén

PMC · DOI: 10.21203/rs.3.rs-8233501/v1 · Research Square · 2026-01-18

## TL;DR

This paper introduces a new method for efficiently and accurately indexing pangenome graphs using tag arrays, enabling lossless querying and haplotype-aware analysis.

## Contribution

A novel, scalable indexing framework using tag arrays for lossless pangenome graph indexing with efficient construction and querying.

## Key findings

- The tag array structure compresses effectively and scales well with added haplotypes.
- The method preserves accurate mapping information across diverse genomic regions.
- It supports efficient one-to-all coordinate translation between haplotypes.

## Abstract

Pangenome graphs represent the genomic variation by encoding multiple haplotypes within a unified graph structure. However, efficient and lossless indexing of such structures remains challenging due to the scale and complexity of pangenomic data. We present a practical and scalable indexing framework based on tag arrays, which annotate positions in the Burrows–Wheeler transform (BWT) with graph coordinates. Our method extends the FM-index with a run-length compressed tag structure that enables efficient retrieval of all unique graph locations where a query pattern appears. We introduce a novel construction algorithm that combines unique k-mers, graph-based extensions, and haplotype traversal to compute the tag array in a memory-efficient manner. To support large genomes, we process each chromosome independently and then merge the results into a unified index using properties of the multi-string BWT and r-index. Our evaluation on the HPRC graphs demonstrates that the tag array structure compresses effectively, scales well with added haplotypes, and preserves accurate mapping information across diverse regions of the genome. This indexing method enables lossless and haplotype-aware querying in complex pangenomes and offers a practical indexing layer to develop scalable aligners and downstream graph-based analysis tools. The index additionally supports efficient one-to-all coordinate translation, enabling any interval on a haplotype to be mapped to its corresponding intervals across all other haplotypes in the graph.

## Full-text entities

- **Genes:** FSHMD1A (facioscapulohumeral muscular dystrophy 1A) [NCBI Gene 2489] {aka FMD, FSHD, FSHD1A, FSHMD}
- **Chemicals:** TAG (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12869693/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12869693/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/PMC12869693/full.md

---
Source: https://tomesphere.com/paper/PMC12869693