Optimal Densification for Fast and Accurate Minwise Hashing
Anshumali Shrivastava

TL;DR
This paper introduces a variance-optimal densification scheme for minwise hashing that maintains high accuracy and efficiency, especially for sparse, high-dimensional data, outperforming existing methods.
Contribution
A novel densification method using tailored 2-universal hashes that achieves variance optimality without sacrificing runtime efficiency.
Findings
Significantly improved accuracy over existing densification techniques.
Maintains the same variance and collision probability as vanilla minwise hashing.
Validated on real sparse, high-dimensional datasets.
Abstract
Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification~\cite{Proc:OneHashLSH_ICML14,Proc:Shrivastava_UAI14} have shown that it is possible to compute minwise hashes, of a vector with nonzeros, in mere computations, a significant improvement over the classical . These advances have led to an algorithmic improvement in the query complexity of traditional indexing algorithms based on minwise hashing. Unfortunately, the variance of the current densification techniques is unnecessarily high, which leads to significantly poor accuracy compared to vanilla minwise hashing, especially when the data is sparse. In this paper, we provide a novel densification scheme which relies on carefully tailored 2-universal hashes. We show that the proposed scheme is variance-optimal, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Caching and Content Delivery · Algorithms and Data Compression
