KMC 2: Fast and resource-frugal $k$-mer counting
Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, Agnieszka, Debudaj-Grabysz

TL;DR
KMC 2 introduces a fast, memory-efficient disk-based method for $k$-mer counting in large genomic datasets, outperforming existing tools in speed while using moderate RAM.
Contribution
The paper presents a novel $k$-mer counting algorithm that significantly improves speed and reduces memory usage by employing signatures and parallel processing.
Findings
At least twice as fast as competitors like Jellyfish 2 and KMC 1.
Counts 28-mers in a human dataset in about 20 minutes on a standard PC.
Uses about 12 GB or less RAM, making it resource-frugal.
Abstract
Motivation: Building the histogram of occurrences of every -symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of -mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for -mer counting, preferably using moderate amounts of memory. Results: We present a novel method for -mer counting, on large datasets at least twice faster than the strongest competitors (Jellyfish~2, KMC~1), using about 12\,GB (or less) of RAM memory. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using -mers allows to significantly reduce the I/O, and a highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
