Estimating the unseen from multiple populations

Aditi Raghunathan; Greg Valiant; James Zou

arXiv:1707.03854·cs.LG·July 14, 2017

Estimating the unseen from multiple populations

Aditi Raghunathan, Greg Valiant, James Zou

PDF

2 Repos

TL;DR

This paper extends the problem of unseen element estimation to multiple populations, providing an optimal estimator whose accuracy is unaffected by the number of populations, with practical validation on genomic data.

Contribution

It introduces a new optimal estimator for multi-population unseen element prediction and an efficient algorithm for estimating multi-population frequency distributions.

Findings

01

Estimator's accuracy is independent of the number of populations

02

Validated methods through extensive experiments

03

Demonstrated application on human genome data

Abstract

Given samples from a distribution, how many new elements should we expect to find if we continue sampling this distribution? This is an important and actively studied problem, with many applications ranging from unseen species estimation to genomics. We generalize this extrapolation and related unseen estimation problems to the multiple population setting, where population $j$ has an unknown distribution $D_{j}$ from which we observe $n_{j}$ samples. We derive an optimal estimator for the total number of elements we expect to find among new samples across the populations. Surprisingly, we prove that our estimator's accuracy is independent of the number of populations. We also develop an efficient optimization algorithm to solve the more general problem of estimating multi-population frequency distributions. We validate our methods and theory through extensive experiments. Finally, on a real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.