b-Bit Minwise Hashing
Ping Li, Arnd Christian Konig

TL;DR
This paper develops a theoretical framework for b-bit minwise hashing, showing how storing only the lowest b bits of hashed values can significantly reduce storage and computational costs in set similarity estimation.
Contribution
It introduces a formal theoretical foundation for b-bit minwise hashing and provides an unbiased estimator for resemblance, demonstrating substantial storage savings.
Findings
b=1 can reduce storage by over 20 times compared to b=64 for resemblance > 0.5
Theoretical results establish unbiased estimation of set resemblance with b-bit hashing
Significant efficiency gains in data retrieval and similarity estimation applications
Abstract
This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, social networks and computational advertising. By only storing the lowest bits of each (minwise) hashed value (e.g., b=1 or 2), one can gain substantial advantages in terms of computational efficiency and storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is interested in resemblance > 0.5.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Labeling and Dimension Problems · graph theory and CDMA systems · Algorithms and Data Compression
