# Fast hashing with Strong Concentration Bounds

**Authors:** Anders Aamand, Jakob B. T. Knudsen, Mathias B. T. Knudsen, Peter M. R., Rasmussen, Mikkel Thorup

arXiv: 1905.00369 · 2020-08-11

## TL;DR

This paper introduces a new hashing technique called tabulation-permutation hashing that provides strong concentration bounds for hash-based sums, even when expectations exceed previous limits, improving analysis of simple tabulation hashing.

## Contribution

The paper develops a new analysis and a novel hashing scheme, tabulation-permutation hashing, that achieves Chernoff-style concentration bounds beyond previous restrictions.

## Key findings

- Tabulation-permutation hashing is at most twice as slow as simple tabulation.
- It offers Chernoff-style concentration bounds for expectations larger than previous limits.
- The new method improves analysis of hash-based sums in large data sets.

## Abstract

Previous work on tabulation hashing by Patrascu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of $c=O(1)$ characters, e.g., a 64-bit key as $c=8$ characters of 8-bits. The character domain $\Sigma$ should be small enough that character tables of size $|\Sigma|$ fit in fast cache. The schemes then use $O(1)$ tables of this size, so the space of tabulation hashing is $O(|\Sigma|)$. However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are $\ll |\Sigma|$.   To see the problem, consider the very simple case where we use tabulation hashing to throw $n$ balls into $m$ bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if $n=m$, for then the expected value is $1$. However, if $m=2$, as when tossing $n$ unbiased coins, the expected value $n/2$ is $\gg |\Sigma|$ for large data sets, e.g., data sets that do not fit in fast cache.   To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call \emph{tabulation-permutation} hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1905.00369/full.md

---
Source: https://tomesphere.com/paper/1905.00369