CREMSA: compressed indexing of (ultra) large multiple sequence alignments
Mikaël Salson, Arthur Boddaert, Awa Bousso Gueye, Laurent Bulteau, Yohan Hernandez--Courbevoie, Camille Marchet, Nan Pan, Sebastian Will, Yann Ponty

TL;DR
CREMSA is a new method for efficiently compressing and querying large multiple sequence alignments, enabling faster analysis of viral genomes.
Contribution
CREMSA introduces a novel column-wise compression approach for MSAs, enabling fast access and improved compression ratios.
Findings
CREMSA compressed a 65 GB SARS-CoV-2 MSA into 22 MB with fast access times.
A new sorting strategy significantly improves compression ratios with low computational cost.
CREMSA enables efficient covariation analysis on ultra-large MSAs.
Abstract
Recent viral outbreaks motivate the systematic collection of pathogenic genomes in order to accelerate their study and monitor the apparition/spread of variants. Due to their limited length and temporal proximity of their sequencing, viral genomes are usually organized, and analyzed as oversized Multiple Sequence Alignments (MSAs). Such MSAs are largely ungapped, and mostly homogeneous on a column-wise level but not at a sequential level due to local variations, hindering the performances of sequential compression algorithms. In order to enable an efficient handling of MSAs, including subsequent statistical analyses, we introduce CREMSA (Column-wise Run-length Encoding for MSAs), a new index that builds on sparse bitvector representations to compress an existing or streamed MSA, all the while allowing for an expressive set of accelerated requests to query the alignment without prior…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · RNA and protein synthesis mechanisms
