The TAG array of a multiple sequence alignment

Jannik Olbrich; Enno Ohlebusch

arXiv:2511.19068·q-bio.GN·November 25, 2025

The TAG array of a multiple sequence alignment

Jannik Olbrich, Enno Ohlebusch

PDF

Open Access

TL;DR

This paper introduces a method to integrate multiple sequence alignments with BWT-based pangenome indexes, enabling efficient mapping of matches to MSA columns and projecting matches to a reference genome, improving analysis efficiency.

Contribution

It presents a novel indexing approach that tags BWT entries with MSA columns, facilitating faster mapping and reference projection in pangenome analysis.

Findings

01

Efficient mapping of BWT matches to MSA columns.

02

Capability to project matches to a reference genome.

03

Improved downstream analysis efficiency.

Abstract

Modern genomic analyses increasingly rely on pangenomes, that is, representations of the genome of entire populations. The simplest representation of a pangenome is a set of individual genome sequences. Compared to e.g. sequence graphs, this has the advantage that efficient exact search via indexes based on the Burrows-Wheeler Transform (BWT) is possible, that no chimeric sequences are created, and that the results are not influenced by heuristics. However, such an index may report a match in thousands of positions even if these all correspond to the same locus, making downstream analysis unnecessarily expensive. For sufficiently similar sequences (e.g. human chromosomes), a multiple sequence alignment (MSA) can be computed. Since an MSA tends to group similar strings in the same columns, it is likely that a string occurring thousands of times in the pangenome can be described by very…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenome Rearrangement Algorithms · Algorithms and Data Compression · Genomics and Phylogenetic Studies