TL;DR
HyperMinHash introduces a compact, floating-point based MinHash sketch that significantly reduces space requirements while maintaining key features like streaming updates and union operations, enabling efficient large-scale Jaccard similarity estimation.
Contribution
It presents HyperMinHash, a novel lossy compression of MinHash using floating-point encoding, enabling sub-logarithmic space complexity with preserved functionalities.
Findings
Estimates Jaccard indices of 0.01 for sets up to 10^{19} elements.
Uses around 64KiB to achieve approximately 10% error.
Outperforms traditional MinHash in large-scale cardinality estimation.
Abstract
In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size to buckets of size by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For a multiplicative approximation error on a Jaccard index , given a random oracle, HyperMinHash needs space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of with relative error of around 10\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
