SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication
Rebecca C. Steorts, Rob Hall, and Stephen E. Fienberg

TL;DR
SMERED introduces a Bayesian, bipartite graph-based method for unsupervised record linkage and de-duplication, enabling uncertainty quantification and efficient computation across multiple files.
Contribution
The paper presents a novel bipartite graph representation and a linear-time MCMC algorithm for record linkage and de-duplication, improving efficiency and uncertainty estimation.
Findings
Effective on real and simulated data
Achieves linear-time complexity
Provides posterior probabilities of matches
Abstract
We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate -way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data
