b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions
Ping Li, Anshumali Shrivastava, Arnd Christian Konig

TL;DR
This paper demonstrates that b-bit minwise hashing, accelerated with GPU preprocessing and simple hash functions, effectively reduces data size and loading time for large-scale batch and online learning in search applications.
Contribution
It introduces a GPU-based parallelization scheme for preprocessing, shows b-bit minwise hashing's effectiveness in online learning, and proves simple hash functions suffice for high-quality results.
Findings
GPU preprocessing reduces time by 20-80 times.
b-bit minwise hashing significantly decreases data loading time.
Simple hash functions produce results comparable to fully random permutations.
Abstract
In this paper, we study several critical issues which must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in the context of search. 1. (b-bit) Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20-80 and becomes substantially smaller than the data loading time. 2. One major advantage of b-bit minwise hashing is that it can substantially reduce the amount of memory required for batch learning. However, as online algorithms become increasingly popular for large-scale learning in the context of search, it is not clear if b-bit minwise yields significant improvements for them.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Algorithms and Data Compression · Caching and Content Delivery
