Bottom-k and Priority Sampling, Set Similarity and Subset Sums with   Minimal Independence

Mikkel Thorup

arXiv:1303.5479·cs.DS·June 12, 2013

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence

Mikkel Thorup

PDF

Open Access

TL;DR

This paper demonstrates that bottom-k sampling can reliably estimate set similarities and subset sums with only 2-independent hash functions, achieving near-optimal error bounds, and extends the approach to weighted sets using priority sampling.

Contribution

It proves that bottom-k sampling maintains accurate estimations with minimal hash independence, reducing the required independence from 8 to 2, and extends the analysis to weighted sets with priority sampling.

Findings

01

Expected relative error is O(1/√(fk)) with 2-independence.

02

Bottom-k sampling achieves near-constant error compared to fully random hashing.

03

Priority sampling effectively handles weighted sets with strong concentration bounds.

Abstract

We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Machine Learning and Algorithms · Algorithms and Data Compression