# CREMSA: compressed indexing of (ultra) large multiple sequence alignments

**Authors:** Mikaël Salson, Arthur Boddaert, Awa Bousso Gueye, Laurent Bulteau, Yohan Hernandez--Courbevoie, Camille Marchet, Nan Pan, Sebastian Will, Yann Ponty

PMC · DOI: 10.1093/bioinformatics/btaf211 · 2025-07-15

## TL;DR

CREMSA is a new method for efficiently compressing and querying large multiple sequence alignments, enabling faster analysis of viral genomes.

## Contribution

CREMSA introduces a novel column-wise compression approach for MSAs, enabling fast access and improved compression ratios.

## Key findings

- CREMSA compressed a 65 GB SARS-CoV-2 MSA into 22 MB with fast access times.
- A new sorting strategy significantly improves compression ratios with low computational cost.
- CREMSA enables efficient covariation analysis on ultra-large MSAs.

## Abstract

Recent viral outbreaks motivate the systematic collection of pathogenic genomes in order to accelerate their study and monitor the apparition/spread of variants. Due to their limited length and temporal proximity of their sequencing, viral genomes are usually organized, and analyzed as oversized Multiple Sequence Alignments (MSAs). Such MSAs are largely ungapped, and mostly homogeneous on a column-wise level but not at a sequential level due to local variations, hindering the performances of sequential compression algorithms.

In order to enable an efficient handling of MSAs, including subsequent statistical analyses, we introduce CREMSA (Column-wise Run-length Encoding for MSAs), a new index that builds on sparse bitvector representations to compress an existing or streamed MSA, all the while allowing for an expressive set of accelerated requests to query the alignment without prior decompression. Using CREMSA, a 65 GB MSA consisting of 1.9M SARS-CoV 2 genomes could be compressed into 22 MB using less than half a gigabyte of main memory, while executing access requests in the order of 100 ns. Such a speed up enables a comprehensive analysis of covariation over this very large MSA. We further assess the impact of the sequence ordering on the compressibility of MSAs and propose a resorting strategy that, despite the proven NP-hardness of an optimal sort, induces greatly increased compression ratios at a marginal computational cost.

CREMSA is freely accessible at https://gitlab.univ-lille.fr/cremsa/cremsa. The Snakemake workflow for the benchmarks is available at: https://gitlab.univ-lille.fr/cremsa/bench. The data used in the paper is on Zenodo at https://zenodo.org/records/14698859 and https://zenodo.org/records/15100011.

## Full-text entities

- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12261481/full.md

---
Source: https://tomesphere.com/paper/PMC12261481