Entity Resolution with Empirically Motivated Priors

Rebecca C. Steorts

arXiv:1409.0643·stat.ME·April 29, 2015

Entity Resolution with Empirically Motivated Priors

Rebecca C. Steorts

PDF

TL;DR

This paper introduces a novel empirical Bayesian approach for entity resolution that effectively handles both categorical and string-valued data, improving accuracy and robustness over existing methods.

Contribution

It proposes an empirical Bayesian method that avoids prior specification issues and models string deviations, enhancing entity resolution in noisy, real-world datasets.

Findings

01

Performs favorably compared to standard methods on simulated and real data

02

Handles both categorical and string-valued variables effectively

03

Demonstrates robustness to hyper-parameter changes

Abstract

Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian--type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.