
TL;DR
This paper addresses the problem of finding minimizers with the lowest possible density in biological sampling schemes, providing algorithms for exact solutions, analyzing their properties via automata theory, and computing optimal minimizers for various parameters.
Contribution
It introduces an efficient algorithm for finding minimum density minimizers, analyzes their properties using regular languages, and computes optimal solutions for specific parameter sets.
Findings
Algorithms for exact minimum density minimizers are developed.
Automata-based analysis provides insights into asymptotic minimizer properties.
Computed minimizers for various parameters outperform average density and theoretical bounds.
Abstract
Minimizers are sampling schemes with numerous applications in computational biology. Assuming a fixed alphabet of size , a minimizer is defined by two integers and a linear order on strings of length (also called -mers). A string is processed by a sliding window algorithm that chooses, in each window of length , its minimal -mer with respect to . A key characteristic of the minimizer is its density, which is the expected frequency of chosen -mers among all -mers in a random infinite -ary string. Minimizers of smaller density are preferred as they produce smaller samples with the same guarantee: each window is represented by a -mer. The problem of finding a minimizer of minimum density for given input parameters has a huge search space of and is representable by an ILP of size…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · DNA and Biological Computing
