Set Similarity Search for Skewed Data
Samuel McCauley, Jesper W. Mikkelsen, Rasmus Pagh

TL;DR
This paper introduces a new data-dependent indexing method for set similarity search that effectively exploits skewed data distributions, providing stronger theoretical guarantees than previous worst-case approaches.
Contribution
It presents a novel analysis of set similarity search under a realistic random data model, demonstrating how skewed data can be leveraged for improved indexing performance.
Findings
The proposed method outperforms traditional heuristics on skewed data.
Theoretical analysis shows stronger guarantees under the random data model.
The approach adapts to data distribution, enhancing search accuracy.
Abstract
Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from {0,1}^d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alpha-correlated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
