Adaptive MapReduce Similarity Joins
Samuel McCauley, Francesco Silvestri

TL;DR
This paper introduces an adaptive MapReduce algorithm for similarity joins that combines the strengths of previous parallel and data-structure adaptive methods, improving efficiency based on data density and output size.
Contribution
It adapts existing LSH-based similarity join algorithms to the parallel setting, achieving bounds that depend on data density and output size without extra parameters.
Findings
Achieves bounds depending on data density and output size
No extra parameters needed, simple modification of existing algorithms
Likely to be efficient in practical applications
Abstract
Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x in S and y in R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aum\"uller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
