The weighted words collector
J\'er\'emie Du Boisberranger (PRISM), Dani\`ele Gardy (PRISM), Yann, Ponty (LIX, INRIA Saclay - Ile de France)

TL;DR
This paper analyzes the expected time to collect all words from a weighted generator in bioinformatics, providing a general asymptotic theorem especially useful for classes with high multiplicities of words.
Contribution
It introduces a new asymptotic theorem for the weighted word collector problem, accommodating high multiplicities and applying it to various language classes.
Findings
Asymptotic regimes of ((\u03bc(n)\u00b7 n)) and (((\u03bc(n) (((bc(n))
The theorem effectively estimates the expected waiting time in weighted word collection scenarios.
Application to three languages demonstrates the theorem's utility in different asymptotic regimes.
Abstract
Motivated by applications in bioinformatics, we consider the word collector problem, i.e. the expected number of calls to a random weighted generator of words of length before the full collection is obtained. The originality of this instance of the non-uniform coupon collector lies in the, potentially large, multiplicity of the words/coupons of a given probability/composition. We obtain a general theorem that gives an asymptotic equivalent for the expected waiting time of a general version of the Coupon Collector. This theorem is especially well-suited for classes of coupons featuring high multiplicities. Its application to a given language essentially necessitates some knowledge on the number of words of a given composition/probability. We illustrate the application of our theorem, in a step-by-step fashion, on three exemplary languages, revealing asymptotic regimes in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · Bayesian Methods and Mixture Models
