Hashing for Sampling-Based Estimation

Anders Aamand; Ioana O. Bercea; Jakob B{\ae}k Tejs Houen; Jonas; Klausen; Mikkel Thorup

arXiv:2411.19394·cs.DS·December 2, 2024

Hashing for Sampling-Based Estimation

Anders Aamand, Ioana O. Bercea, Jakob B{\ae}k Tejs Houen, Jonas, Klausen, Mikkel Thorup

PDF

Open Access

TL;DR

This paper establishes strong concentration bounds for Tornado Tabulation hashing, enhancing the reliability of sampling-based estimation methods like Jaccard similarity in large-scale set comparisons.

Contribution

It provides the first explicit concentration bounds for Tornado Tabulation hashing, significantly improving previous bounds and enabling more accurate sampling-based estimations.

Findings

01

Derived explicit concentration bounds for Tornado Hashing

02

Improved sample size requirements for reliable estimation

03

Enhanced accuracy in set similarity comparisons

Abstract

Hash-based sampling and estimation are common themes in computing. Using hashing for sampling gives us the coordination needed to compare samples from different sets. Hashing is also used when we want to count distinct elements. The quality of the estimator for, say, the Jaccard similarity between two sets, depends on the concentration of the number of sampled elements from their intersection. Often we want to compare one query set against many stored sets to find one of the most similar sets, so we need strong concentration and low error-probability. In this paper, we provide strong explicit concentration bounds for Tornado Tabulation hashing [Bercea, Beretta, Klausen, Houen, and Thorup, FOCS'23] which is a realistic constant time hashing scheme. Previous concentration bounds for fast hashing were off by orders of magnitude, in the sample size needed to guarantee the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Face and Expression Recognition · Algorithms and Data Compression