PASS-JOIN: A Partition-based Method for Similarity Joins
Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng

TL;DR
Pass-Join is a partition-based algorithm for string similarity joins that efficiently supports both short and long strings by partitioning strings, using inverted indices, and employing pruning techniques, outperforming existing methods.
Contribution
We introduce Pass-Join, a novel adaptive partition-based method that efficiently handles string similarity joins for both short and long strings, addressing limitations of prior algorithms.
Findings
Outperforms state-of-the-art methods on real datasets.
Efficiently supports both short and long strings.
Uses novel substring selection and pruning techniques.
Abstract
As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Data Management and Algorithms
