The Unseen Species Problem Revisited

Edward Eriksson

arXiv:2602.08769·math.ST·May 8, 2026

The Unseen Species Problem Revisited

Edward Eriksson

PDF

TL;DR

This paper revisits the unseen species problem, proposing new estimators and prediction intervals for different sample sizes, with theoretical guarantees and improved empirical performance.

Contribution

It introduces a new estimator for intermediate sample sizes, constructs principled prediction intervals, and extends guarantees to incidence data without independence assumptions.

Findings

01

The Good-Toulmin estimator is unique for small m and respects problem symmetries.

02

A new estimator significantly improves worst-case MSE for intermediate m.

03

For large m, a simple estimator matches the rate and outperforms recent methods.

Abstract

Given $n$ i.i.d. samples from an unknown discrete distribution over an unknown set, the unseen species problem is to predict how many new outcomes would be observed in $m$ additional samples. For small $m$ we show that the Good-Toulmin estimator is the unique estimator which both respects the symmetries of the problem and has non-trivial rate. We resolve the open problem of constructing principled prediction intervals for it. For intermediate $m$ we propose a new estimator which has a vastly improved worst case MSE compared to competing methods and we expect that our method can be applied to other species sampling problems. For large $m$ we follow previous authors in assuming a power law tail and show that a simple estimator achieves the same rate and better empirical performance than a recent sophisticated method. Moreover, we give pre-asymptotic guarantees. We extend the rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.