Vigemers: on the number of $k$-mers sharing the same XOR-based minimizer
Florian Ingels, Antoine Limasset, Camille Marchet, Mika\"el Salson

TL;DR
This paper analyzes the distribution of $k$-mers sharing the same XOR-based minimizer in bioinformatics, providing a theoretical framework and algorithms to compute the maximum bucket size for partitioning DNA/RNA sequences.
Contribution
It extends theoretical analysis of minimizer partitions to XOR-based hash functions, offering combinatorial equations and efficient algorithms for their computation.
Findings
Derived formulas for maximum bucket size under XOR-based minimizers
Developed dynamic programming algorithms with $O(km^2)$ complexity
Provided insights into partition quality for bioinformatics applications
Abstract
In bioinformatics, minimizers have become an inescapable method for handling -mers (words of fixed size ) extracted from DNA or RNA sequencing, whether for sampling, storage, querying or partitioning. According to some fixed order on -mers (), the minimizer of a -mer is defined as its smallest -mer -- and acts as its fingerprint. Although minimizers are widely used for partitioning purposes, there is almost no theoretical work on the quality of the resulting partitions. For instance, it has been known for decades that the lexicographic order empirically leads to highly unbalanced partitions that are unusable in practice, but it was not until very recently that this observation was theoretically substantiated. The rejection of the lexicographic order has led the community to resort to (pseudo-)random orders using hash functions. In this work, we extend the theoretical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Fractal and DNA sequence analysis · Genome Rearrangement Algorithms
