Entity Resolution with Empirically Motivated Priors
Rebecca C. Steorts

TL;DR
This paper introduces a novel empirical Bayesian approach for entity resolution that effectively handles both categorical and string-valued data, improving accuracy and robustness over existing methods.
Contribution
It proposes an empirical Bayesian method that avoids prior specification issues and models string deviations, enhancing entity resolution in noisy, real-world datasets.
Findings
Performs favorably compared to standard methods on simulated and real data
Handles both categorical and string-valued variables effectively
Demonstrates robustness to hyper-parameter changes
Abstract
Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian--type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
