Pb-Hash: Partitioned b-bit Hashing
Ping Li, Weijie Zhao

TL;DR
Pb-Hash introduces a partitioned hashing approach that reduces model size by dividing hash bits into chunks, with theoretical analysis and empirical validation showing minimal accuracy loss for small numbers of chunks.
Contribution
The paper proposes Pb-Hash, a novel partitioned hashing method that decreases model size while maintaining accuracy, supported by theoretical analysis and experiments on machine learning models.
Findings
Model size can be significantly reduced with small accuracy loss.
Partitioning into 2-4 chunks maintains high accuracy.
Effective pooling strategies for combining embeddings are explored.
Abstract
Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of bits. With hashes for each data vector, the storage would be bits; and when used for large-scale learning, the model size would be , which can be expensive. A standard strategy is to use only the lowest bits out of the bits and somewhat increase , the number of hashes. In this study, we propose to re-use the hashes by partitioning the bits into chunks, e.g., . Correspondingly, the model size becomes , which can be substantially smaller than the original . Our theoretical analysis reveals that by partitioning the hash values into chunks, the accuracy would drop. In other words, using chunks of bits would not be as accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Caching and Content Delivery · Algorithms and Data Compression
MethodsSupport Vector Machine
