Efficient Similarity Search in Dynamic Data Streams
Marc Bury, Chris Schwiegelshohn, Mara Sorella

TL;DR
This paper develops a space-efficient, dynamic data stream algorithm for approximating rational set similarities, including Jaccard, and introduces the first locality sensitive hashing scheme capable of handling deletions.
Contribution
It presents the first LSH scheme for rational set similarities that is maintainable in dynamic data streams with deletions, along with approximation guarantees.
Findings
Provides a space-efficient $(1 ext{±}\varepsilon)$ approximation method.
Designs a novel LSH scheme for dynamic data streams.
Enables similarity search with deletions in large datasets.
Abstract
The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index of two item sets and is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity admits a locality sensitive hashing (LSH) scheme if and only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Algorithms and Data Compression · Caching and Content Delivery
