Pb-Hash: Partitioned b-bit Hashing

Ping Li; Weijie Zhao

arXiv:2306.15944·cs.LG·June 29, 2023

Pb-Hash: Partitioned b-bit Hashing

Ping Li, Weijie Zhao

PDF

Open Access

TL;DR

Pb-Hash introduces a partitioned hashing approach that reduces model size by dividing hash bits into chunks, with theoretical analysis and empirical validation showing minimal accuracy loss for small numbers of chunks.

Contribution

The paper proposes Pb-Hash, a novel partitioned hashing method that decreases model size while maintaining accuracy, supported by theoretical analysis and experiments on machine learning models.

Findings

01

Model size can be significantly reduced with small accuracy loss.

02

Partitioning into 2-4 chunks maintains high accuracy.

03

Effective pooling strategies for combining embeddings are explored.

Abstract

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of $B$ bits. With $k$ hashes for each data vector, the storage would be $B \times k$ bits; and when used for large-scale learning, the model size would be $2^{B} \times k$ , which can be expensive. A standard strategy is to use only the lowest $b$ bits out of the $B$ bits and somewhat increase $k$ , the number of hashes. In this study, we propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $b \times m = B$ . Correspondingly, the model size becomes $m \times 2^{b} \times k$ , which can be substantially smaller than the original $2^{B} \times k$ . Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop. In other words, using $m$ chunks of $B / m$ bits would not be as accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Caching and Content Delivery · Algorithms and Data Compression

MethodsSupport Vector Machine