Confidence intervals for maximum unseen probabilities, with application to sequential sampling design

Alessandro Colombi; Mario Beraha; Amichai Painsky; Stefano Favaro

arXiv:2601.20320·stat.ME·January 29, 2026

Confidence intervals for maximum unseen probabilities, with application to sequential sampling design

Alessandro Colombi, Mario Beraha, Amichai Painsky, Stefano Favaro

PDF

Open Access

TL;DR

This paper develops nonasymptotic confidence bounds for the maximum unseen probability in discovery problems, enabling effective sequential sampling decisions with finite-sample guarantees, applicable to both finite and infinite category sets.

Contribution

It introduces the first distribution-free, nonasymptotic confidence bounds for maximum unseen probabilities in both bounded and unbounded alphabet regimes, with near-optimality and sequential application.

Findings

01

Proposed data-dependent bounds are near-optimal in both regimes.

02

Established limits of data-independent bounds in unbounded settings.

03

Demonstrated robustness of sequential stopping rules to contamination.

Abstract

Discovery problems often require deciding whether additional sampling is needed to detect all categories whose prevalence exceeds a prespecified threshold. We study this question under a Bernoulli product (incidence) model, where categories are observed only through presence--absence across sampling units. Our inferential target is the \emph{maximum unseen probability}, the largest prevalence among categories not yet observed. We develop nonasymptotic, distribution-free upper confidence bounds for this quantity in two regimes: bounded alphabets (finite and known number of categories) and unbounded alphabets (countably infinite under a mild summability condition). We characterise the limits of data-independent worst-case bounds, showing that in the unbounded regime no nontrivial data-independent procedure can be uniformly valid. We then propose data-dependent bounds in both regimes and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Machine Learning and Algorithms · Bayesian Modeling and Causal Inference