copMEM: Finding maximal exact matches via sampling both genomes

Szymon Grabowski; Wojciech Bieniecki

arXiv:1805.08816·cs.DS·May 24, 2018

copMEM: Finding maximal exact matches via sampling both genomes

Szymon Grabowski, Wojciech Bieniecki

PDF

1 Repo

TL;DR

copMEM is a novel algorithm that efficiently finds all maximum exact matches of a minimum length between large genomes by sampling both genomes with coprime steps, significantly improving speed and memory usage.

Contribution

It introduces copMEM, a new sampling-based method for genome comparison that efficiently computes MEMs with minimal resource requirements.

Findings

01

Finds all MEMs of length ≥100 between human and mouse genomes in under 2 minutes.

02

Uses less than 10 GB RAM, demonstrating high efficiency.

03

Single-threaded implementation with rapid performance.

Abstract

Genome-to-genome comparisons require designating anchor points, which are given by Maximum Exact Matches (MEMs) between their sequences. For large genomes this is a challenging problem and the performance of existing solutions, even in parallel regimes, is not quite satisfactory. We present a new algorithm, copMEM, that allows to sparsely sample both input genomes, with sampling steps being coprime. Despite being a single-threaded implementation, copMEM computes all MEMs of minimum length 100 between the human and mouse genomes in less than 2 minutes, using less than 10 GB of RAM memory.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wbieniec/copmem
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.