Non-Mergeable Sketching for Cardinality Estimation

Seth Pettie; Dingyu Wang; Longhui Yin

arXiv:2008.08739·cs.DS·February 17, 2021

Non-Mergeable Sketching for Cardinality Estimation

Seth Pettie, Dingyu Wang, Longhui Yin

PDF

TL;DR

This paper introduces a new non-mergeable sketching method for cardinality estimation that offers lower variance and efficiency, with theoretical optimality and practical advantages over existing sketches like HyperLogLog.

Contribution

It presents a simpler analysis of Martingale transforms, proves their optimality among linearizable sketches, and develops a new practical sketch called Curtain balancing simplicity and efficiency.

Findings

01

Martingale transform is optimal among linearizable sketches with MVP ~1.63.

02

Curtain sketch achieves MVP ~2.31, balancing simplicity and efficiency.

03

Curtain sketch outperforms HyperLogLog in empirical variance.

Abstract

Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance. We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies $5 m$ bits and whose relative variance is $2/ m$ (standard error $2/ m$ ) has an MVP of $10$ . Our contributions are as follows. Cohen and Ting independently discovered what we call…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.