TL;DR
This paper introduces a new class of locality-sensitive hash algorithms called ProbMinHash, which significantly speeds up similarity estimation for weighted sets by calculating signature components collectively, improving efficiency and accuracy.
Contribution
It proposes four novel ProbMinHash algorithms that are faster and more accurate than existing methods for probability Jaccard similarity estimation.
Findings
Two algorithms are equivalent to original MinHash but faster.
Two algorithms improve estimation accuracy through statistical dependence.
Specialized algorithms outperform traditional MinHash in efficiency.
Abstract
The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow fast estimation of pairwise similarities, it constitutes a valuable method for big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of one-pass locality-sensitive hash algorithms that are orders of magnitude faster than the original approach. The performance gain is achieved by calculating signature components not independently, but collectively. Four different algorithms are proposed based on this idea. Two of them are statistically equivalent to the original approach and can be used as drop-in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
