Regression Modeling and File Matching Using Possibly Erroneous Matching Variables
Nicole M. Dalzell, Jerome P. Reiter

TL;DR
This paper introduces a Bayesian approach for linking records across databases using categorical variables that may contain errors, enabling simultaneous record matching and regression analysis.
Contribution
It develops a hierarchical Bayesian model that accounts for errors in matching variables, improving record linkage and regression estimation accuracy.
Findings
Effective in handling erroneous matching variables
Improves accuracy of record linkage and regression estimates
Demonstrated on artificial and real education data
Abstract
Many analyses require linking records from two databases comprising overlapping sets of individuals. In the absence of unique identifiers, the linkage procedure often involves matching on a set of categorical variables, such as demographics, common to both files. Typically, however, the resulting matches are inexact: some cross-classifications of the matching variables do not generate unique links across files. Further, the variables used for matching can be subject to reporting errors, which introduce additional uncertainty in analyses. We present a Bayesian file matching methodology designed to estimate regression models and match records simultaneously when categorical variables used for matching are subject to errors. The method relies on a hierarchical model that includes (1) the regression of interest involving variables from the two files given a vector indicating the links, (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
