Non-Mergeable Sketching for Cardinality Estimation
Seth Pettie, Dingyu Wang, Longhui Yin

TL;DR
This paper introduces a new non-mergeable sketching method for cardinality estimation that offers lower variance and efficiency, with theoretical optimality and practical advantages over existing sketches like HyperLogLog.
Contribution
It presents a simpler analysis of Martingale transforms, proves their optimality among linearizable sketches, and develops a new practical sketch called Curtain balancing simplicity and efficiency.
Findings
Martingale transform is optimal among linearizable sketches with MVP ~1.63.
Curtain sketch achieves MVP ~2.31, balancing simplicity and efficiency.
Curtain sketch outperforms HyperLogLog in empirical variance.
Abstract
Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance. We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies bits and whose relative variance is (standard error ) has an MVP of . Our contributions are as follows. Cohen and Ting independently discovered what we call…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
