The Adaptive Sampling Revisited
Matthew Drescher, Guy Louchard, Yvik Swan

TL;DR
This paper revisits the adaptive sampling algorithm for estimating the number of distinct keys in large data collections, providing new theoretical insights, distribution analyses, and extensions to colored keys and unknown parameters.
Contribution
It offers simplified derivations of key distributions, new moment results, and methods to estimate proportions and variances for colored keys and multiplicities.
Findings
Distribution of W is rederived simply
New moments of D and W are provided
Estimation methods for colored keys and unknown parameters
Abstract
The problem of estimating the number of distinct keys of a large collection of data is well known in computer science. A classical algorithm is the adaptive sampling (AS). can be estimated by , where is the final bucket (cache) size and is the final depth at the end of the process. Several new interesting questions can be asked about AS (some of them were suggested by P.Flajolet and popularized by J.Lumbroso). The distribution of is known, we rederive this distribution in a simpler way. We provide new results on the moments of and . We also analyze the final cache size distribution. We consider colored keys: assume that among the distinct keys, do have color . We show how to estimate . We also study colored keys with some multiplicity given by some distribution function. We want to estimate mean an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Machine Learning and Algorithms · Bayesian Methods and Mixture Models
