The Adaptive Sampling Revisited

Matthew Drescher; Guy Louchard; Yvik Swan

arXiv:1805.08043·cs.DS·May 17, 2019

The Adaptive Sampling Revisited

Matthew Drescher, Guy Louchard, Yvik Swan

PDF

Open Access

TL;DR

This paper revisits the adaptive sampling algorithm for estimating the number of distinct keys in large data collections, providing new theoretical insights, distribution analyses, and extensions to colored keys and unknown parameters.

Contribution

It offers simplified derivations of key distributions, new moment results, and methods to estimate proportions and variances for colored keys and multiplicities.

Findings

01

Distribution of W is rederived simply

02

New moments of D and W are provided

03

Estimation methods for colored keys and unknown parameters

Abstract

The problem of estimating the number $n$ of distinct keys of a large collection of $N$ data is well known in computer science. A classical algorithm is the adaptive sampling (AS). $n$ can be estimated by $R . 2^{D}$ , where $R$ is the final bucket (cache) size and $D$ is the final depth at the end of the process. Several new interesting questions can be asked about AS (some of them were suggested by P.Flajolet and popularized by J.Lumbroso). The distribution of $W = lo g (R 2^{D} / n)$ is known, we rederive this distribution in a simpler way. We provide new results on the moments of $D$ and $W$ . We also analyze the final cache size $R$ distribution. We consider colored keys: assume that among the $n$ distinct keys, $n_{C}$ do have color $C$ . We show how to estimate $p = \frac{n _{C}}{n}$ . We also study colored keys with some multiplicity given by some distribution function. We want to estimate mean an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Machine Learning and Algorithms · Bayesian Methods and Mixture Models