Cardinality estimation using Gumbel distribution

Aleksander {\L}ukasiewicz; Przemys{\l}aw Uzna\'nski

arXiv:2008.07590·cs.DS·August 19, 2020

Cardinality estimation using Gumbel distribution

Aleksander {\L}ukasiewicz, Przemys{\l}aw Uzna\'nski

PDF

TL;DR

This paper introduces a Gumbel distribution-based modification to LogLog and HyperLogLog algorithms, simplifying their analysis and improving estimator smoothness for cardinality estimation in large datasets.

Contribution

It proposes a novel Gumbel distribution approach that simplifies analysis and enhances the performance of existing cardinality estimation algorithms.

Findings

01

Simpler, more elementary analysis of estimators

02

Smoother estimator behavior

03

Potential improvements in estimation accuracy

Abstract

Cardinality estimation is the task of approximating the number of distinct elements in a large dataset with possibly repeating elements. LogLog and HyperLogLog (c.f. Durand and Flajolet [ESA 2003], Flajolet et al. [Discrete Math Theor. 2007]) are small space sketching schemes for cardinality estimation, which have both strong theoretical guarantees of performance and are highly effective in practice. This makes them a highly popular solution with many implementations in big-data systems (e.g. Algebird, Apache DataSketches, BigQuery, Presto and Redis). However, despite having simple and elegant formulation, both the analysis of LogLog and HyperLogLog are extremely involved -- spanning over tens of pages of analytic combinatorics and complex function analysis. We propose a modification to both LogLog and HyperLogLog that replaces discrete geometric distribution with a continuous Gumbel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.