
TL;DR
This paper revisits the unseen species problem, proposing new estimators and prediction intervals for different sample sizes, with theoretical guarantees and improved empirical performance.
Contribution
It introduces a new estimator for intermediate sample sizes, constructs principled prediction intervals, and extends guarantees to incidence data without independence assumptions.
Findings
The Good-Toulmin estimator is unique for small m and respects problem symmetries.
A new estimator significantly improves worst-case MSE for intermediate m.
For large m, a simple estimator matches the rate and outperforms recent methods.
Abstract
Given i.i.d. samples from an unknown discrete distribution over an unknown set, the unseen species problem is to predict how many new outcomes would be observed in additional samples. For small we show that the Good-Toulmin estimator is the unique estimator which both respects the symmetries of the problem and has non-trivial rate. We resolve the open problem of constructing principled prediction intervals for it. For intermediate we propose a new estimator which has a vastly improved worst case MSE compared to competing methods and we expect that our method can be applied to other species sampling problems. For large we follow previous authors in assuming a power law tail and show that a simple estimator achieves the same rate and better empirical performance than a recent sophisticated method. Moreover, we give pre-asymptotic guarantees. We extend the rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
