A Bayesian Approach to Linking Data Without Unique Identifiers
Edwin Farley, Roee Gutman

TL;DR
This paper introduces gfs_sampler, a Python package that applies Bayesian methods to record linkage, addressing uncertainty and interaction effects, and reducing computational complexity for more accurate data integration.
Contribution
The paper presents a new Python package that implements a Bayesian approach to record linkage, improving accuracy and efficiency over existing methods.
Findings
Reduces bias in variable relationship estimates
Provides interval estimates accounting for linkage uncertainty
Simplifies the implementation of Bayesian linkage methods
Abstract
Existing file linkage methods may produce sub-optimal results because they consider neither the interactions between different pairs of matched records nor relationships between variables that are exclusive to one of the files. In addition, many of the current methods fail to address the uncertainty in the linkage, which may result in overly precise estimates of relationships between variables that are exclusive to one of the files. Bayesian methods for record linkage can reduce the bias in the estimation of scientific relationships of interest and provide interval estimates that account for the uncertainty in the linkage; however, implementation of these methods can often be complex and computationally intensive. This article presents the gfs_sampler package for the Python programming language that utilizes a Bayesian approach for file linkage. The linking procedure implemented in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data-Driven Disease Surveillance · Bayesian Methods and Mixture Models
