Informed and Automated k-Mer Size Selection for Genome Assembly
Rayan Chikhi, Paul Medvedev

TL;DR
This paper introduces KmerGenie, a fast tool that automatically estimates the optimal k-mer size for genome assembly by generating approximate abundance histograms, significantly improving performance over traditional methods.
Contribution
The authors present a novel sampling method for rapid histogram construction and a heuristic for selecting the best k-mer size, enhancing genome assembly quality.
Findings
KmerGenie produces high-quality assemblies with optimal k-mer choices.
The sampling method is several orders of magnitude faster than traditional approaches.
KmerGenie outperforms existing tools in selecting k-mer sizes across diverse datasets.
Abstract
Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. We develop a fast and accurate sampling method that constructs approximate abundance histograms with a several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies. Our tool KmerGenie is freely available at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Chromosomal and Genetic Variations · Evolution and Genetic Dynamics
