A Bayesian Approach to Graphical Record Linkage and De-duplication
Rebecca C. Steorts, Rob Hall, and Stephen E. Fienberg

TL;DR
This paper introduces a Bayesian, graph-based, unsupervised method for linking and de-duplicating records across multiple files, effectively estimating true population attributes and propagating linkage uncertainty.
Contribution
It presents a novel bipartite graph representation for record linkage, enabling efficient, scalable Bayesian inference and integration with downstream analyses.
Findings
Outperforms existing methods in accuracy and scalability.
Provides a visual and probabilistic assessment of record links.
Demonstrates effectiveness on real longitudinal and survey data.
Abstract
We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture-recapture, etc. Our linkage structure lends itself to an efficient, linear-time,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management
