Hashing for statistics over k-partitions
S{\o}ren Dahlgaard, Mathias B{\ae}k Tejs Knudsen, Eva Rotenberg,, Mikkel Thorup

TL;DR
This paper demonstrates that mixed tabulation hashing provides strong concentration bounds for k-partition algorithms, similar to truly random hashing, enabling reliable statistical analysis in data streams and large-scale machine learning.
Contribution
It proves that mixed tabulation hashing achieves strong concentration bounds for k-partitioning, addressing correlation issues in popular hash functions and extending the analysis of probabilistic algorithms.
Findings
Mixed tabulation yields concentration bounds similar to random hashing.
Analysis applies to HyperLogLog and set similarity estimation.
New results include invertible bloom filters and uniform hashing.
Abstract
In this paper we analyze a hash function for -partitioning a set into bins, obtaining strong concentration bounds for standard algorithms combining statistics from each bin. This generic method was originally introduced by Flajolet and Martin~[FOCS'83] in order to save a factor of time per element over independent samples when estimating the number of distinct elements in a data stream. It was also used in the widely used HyperLogLog algorithm of Flajolet et al.~[AOFA'97] and in large-scale machine learning by Li et al.~[NIPS'12] for minwise estimation of set similarity. The main issue of -partition, is that the contents of different bins may be highly correlated when using popular hash functions. This means that methods of analyzing the marginal distribution for a single bin do not apply. Here we show that a tabulation based hash function, mixed tabulation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Machine Learning and Algorithms
