KMC 2: Fast and resource-frugal $k$-mer counting

Sebastian Deorowicz; Marek Kokot; Szymon Grabowski; Agnieszka; Debudaj-Grabysz

arXiv:1407.1507·cs.DS·March 3, 2017

KMC 2: Fast and resource-frugal $k$-mer counting

Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, Agnieszka, Debudaj-Grabysz

PDF

TL;DR

KMC 2 introduces a fast, memory-efficient disk-based method for $k$-mer counting in large genomic datasets, outperforming existing tools in speed while using moderate RAM.

Contribution

The paper presents a novel $k$-mer counting algorithm that significantly improves speed and reduces memory usage by employing signatures and parallel processing.

Findings

01

At least twice as fast as competitors like Jellyfish 2 and KMC 1.

02

Counts 28-mers in a human dataset in about 20 minutes on a standard PC.

03

Uses about 12 GB or less RAM, making it resource-frugal.

Abstract

Motivation: Building the histogram of occurrences of every $k$ -symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of $k$ -mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for $k$ -mer counting, preferably using moderate amounts of memory. Results: We present a novel method for $k$ -mer counting, on large datasets at least twice faster than the strongest competitors (Jellyfish~2, KMC~1), using about 12\,GB (or less) of RAM memory. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using $(k, x)$ -mers allows to significantly reduce the I/O, and a highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.