TL;DR
This paper introduces mum-phinder, an efficient method for computing Maximal Unique Matches (MUMs) using the r-index, enabling scalable analysis of large pangenomic datasets with significant memory savings.
Contribution
It extends the r-index-based approach to compute MUMs, adding LCP samples to enable candidate MUM detection while maintaining space and time efficiency.
Findings
Up to 8 times smaller memory usage compared to competitors.
Up to 19 times slower on less repetitive data.
Up to 6.5 times slower but significantly more memory-efficient on highly repetitive data.
Abstract
In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the -index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the -index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
