Hashing Algorithms for Large-Scale Learning
Ping Li, Anshumali Shrivastava, Joshua Moore, Arnd Christian Konig

TL;DR
This paper introduces a method integrating b-bit minwise hashing with learning algorithms like SVM and logistic regression, enabling efficient large-scale learning on high-dimensional, massive datasets, and demonstrates its advantages over existing methods.
Contribution
The paper presents a novel integration of b-bit minwise hashing with linear learning algorithms, improving efficiency and accuracy in large-scale, high-dimensional data scenarios.
Findings
b-bit minwise hashing can be integrated with SVM and logistic regression.
Compared to VW, b-bit minwise hashing is more accurate at the same storage level.
Combining b-bit minwise hashing with VW enhances training speed, especially with larger b.
Abstract
In this paper, we first demonstrate that b-bit minwise hashing, whose estimators are positive definite kernels, can be naturally integrated with learning algorithms such as SVM and logistic regression. We adopt a simple scheme to transform the nonlinear (resemblance) kernel into linear (inner product) kernel; and hence large-scale problems can be solved extremely efficiently. Our method provides a simple effective solution to large-scale learning in massive and extremely high-dimensional datasets, especially when data do not fit in memory. We then compare b-bit minwise hashing with the Vowpal Wabbit (VW) algorithm (which is related the Count-Min (CM) sketch). Interestingly, VW has the same variances as random projections. Our theoretical and empirical comparisons illustrate that usually -bit minwise hashing is significantly more accurate (at the same storage) than VW (and random…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Algorithms and Data Compression · Spam and Phishing Detection
