Variational Bayes for Merging Noisy Databases
Tamara Broderick, Rebecca C. Steorts

TL;DR
This paper introduces a scalable variational Bayesian approach for merging large, noisy databases to accurately identify unique individuals and quantify uncertainty, overcoming the computational limitations of previous MCMC methods.
Contribution
It develops a variational inference algorithm for Bayesian entity resolution, enabling efficient processing of massive databases and addressing challenges related to cluster size distributions.
Findings
The proposed variational method is faster than MCMC approaches.
It effectively merges noisy databases with high accuracy.
The algorithm handles large-scale data with improved computational efficiency.
Abstract
Bayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian methods for entity resolution use Markov monte Carlo method (MCMC) approximations and are too slow to run on modern databases containing millions or billions of records. Instead, we propose applying variational approximations to allow scalable Bayesian inference in these models. We derive a coordinate-ascent approximation for mean-field variational Bayes, qualitatively compare our algorithm to existing methods, note unique challenges for inference that arise from the expected distribution of cluster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Bayesian Methods and Mixture Models · Data Management and Algorithms
