TransClean: Finding False Positives in Multi-Source Entity Matching under Real-World Conditions via Transitive Consistency
Fernando de Meer Pardo, Branka Hadji Misheva, Martin Braschler, Kurt Stockinger

TL;DR
TransClean is a novel method that detects false positives in multi-source entity matching by leveraging transitive consistency, improving matching accuracy without manual labeling under real-world noisy conditions.
Contribution
It introduces a transitive consistency-based approach for identifying false positives in multi-source entity matching, operating efficiently with limited manual labels and handling distributional shifts.
Findings
TransClean improves F1 score by an average of 24.42 points.
It effectively detects false positives across various datasets.
The method enhances existing matching models without retraining.
Abstract
We present TransClean, a method for detecting false positive predictions of entity matching algorithms under real-world conditions characterized by large-scale, noisy, and unlabeled multi-source datasets that undergo distributional shifts. TransClean is explicitly designed to operate with multiple data sources in an efficient, robust and fast manner while accounting for edge cases and requiring limited manual labeling. TransClean leverages the Transitive Consistency of a matching, a measure of the consistency of a pairwise matching model f_theta on the matching it produces G_f_theta, based both on its predictions on directly evaluated record pairs and its predictions on implied record pairs. TransClean iteratively modifies a matching through gradually removing false positive matches while removing as few true positive matches as possible. In each of these steps, the estimation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Time Series Analysis and Forecasting
