These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang, Howe, C. Titus Brown

TL;DR
The paper introduces khmer, a memory-efficient online k-mer counting software using a Count-Min Sketch, enabling fast, scalable analysis of sequencing data with controlled overcounting, outperforming existing tools in speed and memory usage.
Contribution
It presents a novel k-mer counting approach based on a probabilistic data structure, improving memory efficiency and enabling online analysis compared to traditional exact methods.
Findings
khmer is faster and more memory-efficient than existing tools
it introduces systematic overcounting due to the Count-Min Sketch
khmer effectively supports error profiling and read normalization
Abstract
K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
