Confidence intervals for maximum unseen probabilities, with application to sequential sampling design
Alessandro Colombi, Mario Beraha, Amichai Painsky, Stefano Favaro

TL;DR
This paper develops nonasymptotic confidence bounds for the maximum unseen probability in discovery problems, enabling effective sequential sampling decisions with finite-sample guarantees, applicable to both finite and infinite category sets.
Contribution
It introduces the first distribution-free, nonasymptotic confidence bounds for maximum unseen probabilities in both bounded and unbounded alphabet regimes, with near-optimality and sequential application.
Findings
Proposed data-dependent bounds are near-optimal in both regimes.
Established limits of data-independent bounds in unbounded settings.
Demonstrated robustness of sequential stopping rules to contamination.
Abstract
Discovery problems often require deciding whether additional sampling is needed to detect all categories whose prevalence exceeds a prespecified threshold. We study this question under a Bernoulli product (incidence) model, where categories are observed only through presence--absence across sampling units. Our inferential target is the \emph{maximum unseen probability}, the largest prevalence among categories not yet observed. We develop nonasymptotic, distribution-free upper confidence bounds for this quantity in two regimes: bounded alphabets (finite and known number of categories) and unbounded alphabets (countably infinite under a mild summability condition). We characterise the limits of data-independent worst-case bounds, showing that in the unbounded regime no nontrivial data-independent procedure can be uniformly valid. We then propose data-dependent bounds in both regimes and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Machine Learning and Algorithms · Bayesian Modeling and Causal Inference
