I/O-Efficient Similarity Join
Rasmus Pagh, Ninh Pham, Francesco Silvestri, Morten St\"ockel

TL;DR
This paper introduces an I/O-efficient, cache-oblivious algorithm for similarity joins using LSH, achieving sub-quadratic complexity and better internal memory utilization compared to traditional methods.
Contribution
The paper presents a novel I/O-efficient, recursive LSH-based similarity join algorithm that improves upon existing external memory algorithms by leveraging internal memory more effectively.
Findings
Achieves sub-quadratic dependency on data size
Reduces I/O complexity to (N/M)^ρ from N^ρ
Outputs correct results with high probability
Abstract
We present an I/O-efficient algorithm for computing similarity joins based on locality-sensitive hashing (LSH). In contrast to the filtering methods commonly suggested our method has provable sub-quadratic dependency on the data size. Further, in contrast to straightforward implementations of known LSH-based algorithms on external memory, our approach is able to take significant advantage of the available internal memory: Whereas the time complexity of classical algorithms includes a factor of , where is a parameter of the LSH used, the I/O complexity of our algorithm merely includes a factor , where is the data size and is the size of internal memory. Our algorithm is randomized and outputs the correct result with high probability. It is a simple, recursive, cache-oblivious procedure, and we believe that it will be useful also in other computational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Image and Video Retrieval Techniques · Data Management and Algorithms
