Extrapolation of Urn Models via Poissonization: Accurate Measurements of the Microbial Unknown
Manuel Lladser, Ra\'ul Gouet, Jens Reeder

TL;DR
This paper introduces a novel Poissonization-based method for accurately predicting the fraction of unobserved species in microbial communities, improving understanding of microbial diversity beyond traditional lower-bound estimates.
Contribution
It presents the Embedding algorithm that provides conditionally unbiased predictions and exact prediction intervals for unseen species fractions using urn models and Poissonization.
Findings
The method yields highly accurate predictions on subsamples.
Predictions are robust across different datasets, including human microbiota.
The approach extends to other domains like RNA solutions and security surveillance.
Abstract
The availability of high-throughput parallel methods for sequencing microbial communities is increasing our knowledge of the microbial world at an unprecedented rate. Though most attention has focused on determining lower-bounds on the alpha-diversity i.e. the total number of different species present in the environment, tight bounds on this quantity may be highly uncertain because a small fraction of the environment could be composed of a vast number of different species. To better assess what remains unknown, we propose instead to predict the fraction of the environment that belongs to unsampled classes. Modeling samples as draws with replacement of colored balls from an urn with an unknown composition, and under the sole assumption that there are still undiscovered species, we show that conditionally unbiased predictors and exact prediction intervals (of constant length in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
