b-Bit Minwise Hashing

Ping Li; Arnd Christian Konig

arXiv:0910.3349·cs.DS·October 20, 2009·4 cites

b-Bit Minwise Hashing

Ping Li, Arnd Christian Konig

PDF

Open Access

TL;DR

This paper develops a theoretical framework for b-bit minwise hashing, showing how storing only the lowest b bits of hashed values can significantly reduce storage and computational costs in set similarity estimation.

Contribution

It introduces a formal theoretical foundation for b-bit minwise hashing and provides an unbiased estimator for resemblance, demonstrating substantial storage savings.

Findings

01

b=1 can reduce storage by over 20 times compared to b=64 for resemblance > 0.5

02

Theoretical results establish unbiased estimation of set resemblance with b-bit hashing

03

Significant efficiency gains in data retrieval and similarity estimation applications

Abstract

This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, social networks and computational advertising. By only storing the lowest $b$ bits of each (minwise) hashed value (e.g., b=1 or 2), one can gain substantial advantages in terms of computational efficiency and storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is interested in resemblance > 0.5.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGraph Labeling and Dimension Problems · graph theory and CDMA systems · Algorithms and Data Compression