A Framework for Estimating Stream Expression Cardinalities

Anirban Dasgupta; Kevin Lang; Lee Rhodes; Justin Thaler

arXiv:1510.01455·cs.DS·February 25, 2016

A Framework for Estimating Stream Expression Cardinalities

Anirban Dasgupta, Kevin Lang, Lee Rhodes, Justin Thaler

PDF

TL;DR

This paper introduces a comprehensive framework for accurately estimating the number of unique elements in complex set expressions over multiple distributed data streams, unifying existing methods and proposing new algorithms.

Contribution

It presents a broad class of unbiased estimators with strong variance bounds, generalizing previous results and introducing novel sampling algorithms with improved tradeoffs.

Findings

01

Estimators are perfectly unbiased with strong variance bounds.

02

New sampling algorithms achieve better accuracy-space tradeoffs.

03

Framework unifies and extends prior work on stream cardinality estimation.

Abstract

Given $m$ distributed data streams $A_{1}, \dots, A_{m}$ , we consider the problem of estimating the number of unique identifiers in streams defined by set expressions over $A_{1}, \dots, A_{m}$ . We identify a broad class of algorithms for solving this problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfy strong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrate its generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoff between accuracy, space usage, update speed, and applicability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.