Maintaining $k$-MinHash Signatures over Fully-Dynamic Data Streams with Recovery
Andrea Clementi, Luciano Gual\`a, Luca Pep\`e Sciarria, Alessandro, Straziota

TL;DR
This paper introduces a buffered, dynamic version of the $k$-MinHash sketch that efficiently manages updates in fully-dynamic data streams, enabling fast similarity queries with reduced rebuilds and maintaining high accuracy.
Contribution
It presents a novel buffered $k$-MinHash data structure that improves update efficiency and reduces costly rebuilds in dynamic data streams.
Findings
Uses $O(k \, \log |U|)$ memory per subset.
Achieves $O(k \, \log |U|)$ amortized update time.
Returns exact $k$-MinHash signatures in $O(k)$ time.
Abstract
We consider the task of performing Jaccard similarity queries over a large collection of items that are dynamically updated according to a streaming input model. An item here is a subset of a large universe of elements. A well-studied approach to address this important problem in data mining is to design fast-similarity data sketches. In this paper, we focus on global solutions for this problem, i.e., a single data structure which is able to answer both Similarity Estimation and All-Candidate Pairs queries, while also dynamically managing an arbitrary, online sequence of element insertions and deletions received in input. We introduce and provide an in-depth analysis of a dynamic, buffered version of the well-known -MinHash sketch. This buffered version better manages critical update operations thus significantly reducing the number of times the sketch needs to be rebuilt from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Cryptography and Data Security · Privacy-Preserving Technologies in Data
