Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
Mikkel Thorup

TL;DR
This paper demonstrates that bottom-k sampling can reliably estimate set similarities and subset sums with only 2-independent hash functions, achieving near-optimal error bounds, and extends the approach to weighted sets using priority sampling.
Contribution
It proves that bottom-k sampling maintains accurate estimations with minimal hash independence, reducing the required independence from 8 to 2, and extends the analysis to weighted sets with priority sampling.
Findings
Expected relative error is O(1/√(fk)) with 2-independence.
Bottom-k sampling achieves near-constant error compared to fully random hashing.
Priority sampling effectively handles weighted sets with strong concentration bounds.
Abstract
We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Machine Learning and Algorithms · Algorithms and Data Compression
