Maximally Consistent Sampling and the Jaccard Index of Probability   Distributions

Ryan Moulton; Yunjiang Jiang

arXiv:1809.04052·cs.DS·January 4, 2019

Maximally Consistent Sampling and the Jaccard Index of Probability Distributions

Ryan Moulton, Yunjiang Jiang

PDF

1 Repo

TL;DR

This paper presents efficient algorithms for computing MinHash of probability distributions, introducing a new similarity measure based on collision probability that generalizes the Jaccard index and is optimal for Locality Sensitive Hashing.

Contribution

It introduces simple, efficient algorithms for MinHash of distributions and proposes a new, optimal similarity measure based on collision probability, generalizing the Jaccard index.

Findings

01

Algorithms are suitable for sparse and dense data with comparable running times.

02

Collision probability is a new, optimal similarity measure for positive vectors.

03

The measure generalizes the Jaccard index and is more useful for probability distributions.

Abstract

We introduce simple, efficient algorithms for computing a MinHash of a probability distribution, suitable for both sparse and dense data, with equivalent running times to the state of the art for both cases. The collision probability of these algorithms is a new measure of the similarity of positive vectors which we investigate in detail. We describe the sense in which this collision probability is optimal for any Locality Sensitive Hash based on sampling. We argue that this similarity measure is more useful for probability distributions than the similarity pursued by other algorithms for weighted MinHash, and is the natural generalization of the Jaccard index.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jean-pierreBoth/probminhash
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.