Scaling Multiple-Source Entity Resolution using Statistically Efficient Transfer Learning
Sahand Negahban, Benjamin I. P. Rubinstein, Jim Gemmell

TL;DR
This paper introduces a transfer learning method for multi-source entity resolution that significantly reduces labeling costs and maintains high accuracy, addressing the quadratic data scaling challenge in heterogeneous data sources.
Contribution
The paper presents a novel transfer learning algorithm that efficiently shares structure across sources, reducing labeling requirements and computational costs in multi-source entity resolution.
Findings
Achieves constant precision/recall with linear labeling cost increase
Requires less training data than state-of-the-art methods
Maintains accuracy without runtime overhead
Abstract
We consider a serious, previously-unexplored challenge facing almost all approaches to scaling up entity resolution (ER) to multiple data sources: the prohibitive cost of labeling training data for supervised learning of similarity scores for each pair of sources. While there exists a rich literature describing almost all aspects of pairwise ER, this new challenge is arising now due to the unprecedented ability to acquire and store data from online sources, features driven by ER such as enriched search verticals, and the uniqueness of noisy and missing data characteristics for each source. We show on real-world and synthetic data that for state-of-the-art techniques, the reality of heterogeneous sources means that the number of labeled training data must scale quadratically in the number of sources, just to maintain constant precision/recall. We address this challenge with a brand new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data-Driven Disease Surveillance · Topic Modeling
