Consistent Weighted Sampling Made Fast, Small, and Easy
Bernhard Haeupler, Mark Manasse, Kunal Talwar

TL;DR
This paper introduces a fast, simple, and accurate method for weighted set sketching that reduces weighted sets to unweighted sets, enabling efficient similarity estimation with minimal bias and high accuracy.
Contribution
It presents a novel reduction technique for weighted sets to unweighted sets that improves computational efficiency while maintaining accuracy, applicable to existing sketching schemes.
Findings
Significant computational gains demonstrated empirically.
Bias introduced by the reduction is negligible.
Method maintains accuracy comparable to unweighted schemes.
Abstract
Document sketching using Jaccard similarity has been a workable effective technique in reducing near-duplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning applications. Min-wise sampling can be used to derive an unbiased estimator for Jaccard similarity and taking a few hundred independent consistent samples leads to compact sketches which provide good estimates of pairwise-similarity. Subsequent works extended this technique to weighted sets and show how to produce samples with only a constant number of hash evaluations for any element, independent of its weight. Another improvement by Li et al. shows how to speedup sketch computations by computing many (near-)independent samples in one shot. Unfortunately this latter improvement works only for the unweighted case. In this paper we give a simple, fast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Algorithms and Data Compression · Advanced Neural Network Applications
