Scalable and robust set similarity join
Tobias Christiani, Rasmus Pagh, Johan Sivertsen

TL;DR
This paper introduces a new randomized algorithm for set similarity join that achieves high recall and significantly outperforms existing exact and approximate methods in speed, especially on datasets lacking rare tokens.
Contribution
The paper presents a novel randomized algorithm for set similarity join that is robust, scalable, and improves performance over existing methods, leveraging recent high-dimensional sketching techniques.
Findings
At 90% recall, the algorithm is over ten times faster than exact methods.
The method outperforms approximate methods by several times in speed.
It maintains high recall regardless of dataset token distribution.
Abstract
Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important --- indeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
