Similarity Join Size Estimation using Locality Sensitive Hashing
Hongrae Lee (University of British Columbia), Raymond T. Ng, (University of British Columbia), Kyuseok Shim (Seoul National University)

TL;DR
This paper introduces a sampling-based size estimation method for vector similarity joins using Locality Sensitive Hashing, improving accuracy over existing techniques especially at high similarity thresholds.
Contribution
It extends set similarity join size estimation to vector spaces with a novel LSH-based sampling algorithm, enhancing accuracy and variance reduction.
Findings
LSH-SS outperforms random sampling in accuracy.
LSh-SS provides stable estimates at various thresholds.
Method is validated on real-world datasets.
Abstract
Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses the Locality-Sensitive-Hashing (LSH) scheme. The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates at both high and low similarity thresholds and small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Data Management and Algorithms · Algorithms and Data Compression
