Near-optimal estimation of the unseen under regularly varying tail populations
Stefano Favaro, Zacharie Naulet

TL;DR
This paper introduces a simple, efficient estimator for predicting the number of unseen species in a population with regularly varying tail distributions, achieving near-optimality and consistency over a broad range.
Contribution
It develops the first estimator for unseen species under regularly varying tail assumptions, improving upon non-parametric methods by leveraging semi-parametric tail behavior.
Findings
Estimator is minimax near-optimal up to a log factor
Estimator is consistent up to a theoretically optimal range
Method demonstrated on synthetic and real datasets
Abstract
Given samples from a population of individuals belonging to different species, what is the number of hitherto unseen species that would be observed if new samples were collected? This is an important problem in many scientific endeavors, and it has been the subject of recent works introducing non-parametric estimators of that are minimax near-optimal and consistent all the way up to . These works do not rely on any assumption on the underlying unknown distribution of the population, and therefore, while providing a theory in its greatest generality, worst-case distributions may severely hamper the estimation of in concrete applications. In this paper, we consider the problem of strengthening the non-parametric framework for estimating . Inspired by the estimation of rare probabilities in extreme value theory, and motivated by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference · Statistical Methods and Inference
