TL;DR
BagMinHash is a novel, highly efficient minwise hashing algorithm for weighted sets that significantly outperforms existing methods and introduces the first efficient approach for unweighted sets with independent signature components.
Contribution
The paper introduces BagMinHash, a fast and versatile minwise hashing algorithm for weighted sets, and the first efficient method for unweighted sets with independent signatures.
Findings
BagMinHash is orders of magnitude faster than existing algorithms.
It works efficiently for both weighted and unweighted sets.
The algorithm produces independent signature components for unweighted sets.
Abstract
Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
