ApproxJoin: Approximate Matching for Efficient Verification in Fuzzy Set Similarity Join

Michael Mandulak; S M Ferdous; Sayan Ghosh; Mahantesh Halappanavar; George Slota

arXiv:2507.18891·cs.DB·July 28, 2025

ApproxJoin: Approximate Matching for Efficient Verification in Fuzzy Set Similarity Join

Michael Mandulak, S M Ferdous, Sayan Ghosh, Mahantesh Halappanavar, George Slota

PDF

Open Access

TL;DR

ApproxJoin introduces approximate maximum weight matching algorithms to significantly improve the efficiency of fuzzy set similarity join verification while maintaining high accuracy, outperforming state-of-the-art exact methods.

Contribution

It is the first to apply approximate matching algorithms for verification in fuzzy set similarity joins, achieving substantial performance gains.

Findings

01

Performance improvements of 2-19x over state-of-the-art methods.

02

High accuracy with 99% recall maintained.

03

Evaluation of three approximate matching algorithms.

Abstract

The set similarity join problem is a fundamental problem in data processing and discovery, relying on exact similarity measures between sets. In the presence of alterations, such as misspellings on string data, the fuzzy set similarity join problem instead approximately matches pairs of elements based on the maximum weighted matching of the bipartite graph representation of sets. State-of-the-art methods within this domain improve performance through efficient filtering methods within the filter-verify framework, primarily to offset high verification costs induced by the usage of the Hungarian algorithm - an optimal matching method. Instead, we directly target the verification process to assess the efficacy of more efficient matching methods within candidate pair pruning. We present ApproxJoin, the first work of its kind in applying approximate maximum weight matching algorithms for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Graph Theory and Algorithms · Data Management and Algorithms