On the number of $k$-mers admitting a given lexicographical minimizer
Florian Ingels, Camille Marchet, Mika\"el Salson

TL;DR
This paper investigates the theoretical distribution of $k$-mers sharing a lexicographical minimizer, providing algorithms to compute worst-case bucket sizes and analyzing their practical implications in bioinformatics datasets.
Contribution
It introduces the first theoretical analysis of lexicographical minimizer bucket sizes and provides algorithms for their computation and approximation.
Findings
Worst-case bucket size can be computed in $O(km)$ space and $O(km^2)$ time.
Practical bucket sizes on genomic data closely match theoretical expectations.
Two conjectures are proposed to improve approximation accuracy.
Abstract
The minimizer of a word of size (a -mer) is defined as its smallest substring of size (with ), according to some ordering on -mers. minimizers have been used in bioinformatics -- notably -- to partition sequencing datasets, binning together -mers that share the same minimizer. It is folklore that using the lexicographical order lead to very unbalanced partitions, resulting in an abundant literature devoted to devising alternative orders for achieving better balanced partitions. To the best of our knowledge, the unbalanced-ness of lexicographical-based minimizer partitions has never been investigated from a theoretical point of view. In this article, we aim to fill this gap and determine, for a given minimizer, how many -mers would admit the chosen minimizer -- i.e. what would be the size of the bucket associated to the chosen minimizer in the worst case,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Mathematical Identities · Analytic Number Theory Research · Coding theory and cryptography
