Hashing for Sampling-Based Estimation
Anders Aamand, Ioana O. Bercea, Jakob B{\ae}k Tejs Houen, Jonas, Klausen, Mikkel Thorup

TL;DR
This paper establishes strong concentration bounds for Tornado Tabulation hashing, enhancing the reliability of sampling-based estimation methods like Jaccard similarity in large-scale set comparisons.
Contribution
It provides the first explicit concentration bounds for Tornado Tabulation hashing, significantly improving previous bounds and enabling more accurate sampling-based estimations.
Findings
Derived explicit concentration bounds for Tornado Hashing
Improved sample size requirements for reliable estimation
Enhanced accuracy in set similarity comparisons
Abstract
Hash-based sampling and estimation are common themes in computing. Using hashing for sampling gives us the coordination needed to compare samples from different sets. Hashing is also used when we want to count distinct elements. The quality of the estimator for, say, the Jaccard similarity between two sets, depends on the concentration of the number of sampled elements from their intersection. Often we want to compare one query set against many stored sets to find one of the most similar sets, so we need strong concentration and low error-probability. In this paper, we provide strong explicit concentration bounds for Tornado Tabulation hashing [Bercea, Beretta, Klausen, Houen, and Thorup, FOCS'23] which is a realistic constant time hashing scheme. Previous concentration bounds for fast hashing were off by orders of magnitude, in the sample size needed to guarantee the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Face and Expression Recognition · Algorithms and Data Compression
