Cardinality Estimation Meets Good-Turing
Reuven Cohen, Liran Katzir, Aviv Yehezkel

TL;DR
This paper introduces a generic sampling algorithm for cardinality estimation that preserves asymptotic unbiasedness and analyzes its impact on variance, enabling efficient processing of large data streams.
Contribution
It proposes a novel sampling method integrated with existing cardinality estimators, maintaining accuracy and providing variance analysis.
Findings
Sampling does not affect asymptotic unbiasedness.
The variance of estimators is analytically characterized.
The method enables scalable cardinality estimation on large streams.
Abstract
Cardinality estimation algorithms receive a stream of elements whose order might be arbitrary, with possible repetitions, and return the number of distinct elements. Such algorithms usually seek to minimize the required storage and processing at the price of inaccuracy in their output. Real-world applications of these algorithms are required to process large volumes of monitored data, making it impractical to collect and analyze the entire input stream. In such cases, it is common practice to sample and process only a small part of the stream elements. This paper presents and analyzes a generic algorithm for combining every cardinality estimation algorithm with a sampling process. We show that the proposed sampling algorithm does not affect the estimator's asymptotic unbiasedness, and we analyze the sampling effect on the estimator's variance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
