TL;DR
SetSketch is a new data structure that bridges the gap between MinHash and HyperLogLog, enabling efficient, accurate set similarity estimation and cardinality counting in distributed big data applications.
Contribution
Introduces SetSketch, a novel sketching algorithm that combines features of MinHash and HyperLogLog, with improved estimators and versatility for various set operations.
Findings
SetSketch performs better than state-of-the-art estimators in many cases.
It enables fast, robust, and easy-to-implement estimations for cardinality and similarity.
The joint estimator can be applied to other data structures like MinHash and HyperLogLog.
Abstract
MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
