TL;DR
LES3 introduces a learning-based exact set similarity search method that uses a lightweight bitmap-like index and optimized partitioning strategies to improve efficiency over traditional approaches.
Contribution
The paper proposes LES3, a novel approach combining learning-based partitioning and a new indexing structure for exact set similarity search, outperforming existing methods.
Findings
LES3 achieves higher accuracy in set similarity search.
LES3 reduces search time compared to traditional methods.
Experimental results validate the effectiveness of LES3.
Abstract
Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs or indexes that produce large candidate sets. In this paper, we design a learning-based exact set similarity search approach, LES3. Our approach first partitions sets into groups, and then utilizes a light-weight bitmap-like indexing structure, called token-group matrix (TGM), to organize groups and prune out candidates given a query set. In order to optimize pruning using the TGM, we analytically investigate the optimal partitioning strategy under certain distributional assumptions. Using these results, we then design a learning-based partitioning approach called L2P and an associated data representation encoding, PTR, to identify the partitions. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
