Simple Set Sketching

Jakob B{\ae}k Tejs Houen; Rasmus Pagh; Stefan Walzer

arXiv:2211.03683·cs.DS·November 8, 2022

Simple Set Sketching

Jakob B{\ae}k Tejs Houen, Rasmus Pagh, Stefan Walzer

PDF

Open Access

TL;DR

This paper introduces a simple yet effective set sketching method based on repeated collision resolution in hash tables, enabling near-perfect recovery of inserted keys below a certain load factor, with a complex analysis inspired by invertible Bloom filters.

Contribution

It presents a novel collision resolution technique using repeated hashing and quotienting, allowing linear-time key recovery below a specific load threshold.

Findings

01

Recovery is possible with high probability if load factor is below 0.81.

02

The approach extends the invertible Bloom filter concept with implicit checksums.

03

Analysis shows the method's effectiveness despite the simple description.

Abstract

Imagine handling collisions in a hash table by storing, in each cell, the bit-wise exclusive-or of the set of keys hashing there. This appears to be a terrible idea: For $α n$ keys and $n$ buckets, where $α$ is constant, we expect that a constant fraction of the keys will be unrecoverable due to collisions. We show that if this collision resolution strategy is repeated three times independently the situation reverses: If $α$ is below a threshold of $\approx 0.81$ then we can recover the set of all inserted keys in linear time with high probability. Even though the description of our data structure is simple, its analysis is nontrivial. Our approach can be seen as a variant of the Invertible Bloom Filter (IBF) of Eppstein and Goodrich. While IBFs involve an explicit checksum per bucket to decide whether the bucket stores a single key, we exploit the idea of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · DNA and Biological Computing · Algorithms and Data Compression