Adaptive MapReduce Similarity Joins

Samuel McCauley; Francesco Silvestri

arXiv:1804.05615·cs.DS·April 17, 2018

Adaptive MapReduce Similarity Joins

Samuel McCauley, Francesco Silvestri

PDF

TL;DR

This paper introduces an adaptive MapReduce algorithm for similarity joins that combines the strengths of previous parallel and data-structure adaptive methods, improving efficiency based on data density and output size.

Contribution

It adapts existing LSH-based similarity join algorithms to the parallel setting, achieving bounds that depend on data density and output size without extra parameters.

Findings

01

Achieves bounds depending on data density and output size

02

No extra parameters needed, simple modification of existing algorithms

03

Likely to be efficient in practical applications

Abstract

Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x in S and y in R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aum\"uller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.