MaxSketch: Robust Distinct Counting in Streams via Random Projections
Nikos Tsikouras, Constantine Caramanis, Christos Tzamos

TL;DR
MaxSketch introduces a simple, Gaussian projection-based sketch that efficiently estimates the number of distinct high-dimensional, noisy objects in data streams by leveraging geometric structure, outperforming classical methods.
Contribution
The paper presents MaxSketch, a novel max-linear sketch that exploits geometric structure in learned representations to significantly improve memory efficiency for distinct counting.
Findings
MaxSketch accurately estimates distinct counts in image streams.
It requires only logarithmic in n, inverse squared epsilon, number of projections for (1+epsilon) accuracy.
Experiments show MaxSketch outperforms classical methods and generalizes beyond training data.
Abstract
Estimating the number of distinct elements in a data stream is well understood when repeated elements are identical. In modern settings, however, observations are high-dimensional and noisy, so repeated instances of the same object are only approximately similar -- for example, different images of the same individual may vary significantly at the pixel level. Classical sketches such as HyperLogLog rely on consistent hash values for identical elements and break down in this regime. Recent work on robust distinct counting in general metric spaces achieves memory, which is tight in the worst case. We show that substantially improved memory guarantees are possible under geometric structure common in learned representations. We introduce MaxSketch, a simple max-linear sketch built from random Gaussian projections, and prove that it succeeds in estimating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
