Cluster-based name embeddings reduce ethnic disparities in record linkage quality under realistic name corruption: evidence from the North Carolina Voter Registry
Joseph Lam, Mario Cortina-Borja, Rob Aldridge, Ruth Blackburn, Katie Harron

TL;DR
This study demonstrates that cluster-based name embeddings significantly reduce ethnic disparities in record linkage errors caused by name corruption, especially under realistic corruption scenarios, improving fairness in epidemiologic data linkage.
Contribution
It introduces a cluster-based forename-embedding approach that narrows ethnic disparities in record linkage errors, outperforming traditional string similarity methods under realistic name corruption conditions.
Findings
Cluster-based embeddings reduce under-linkage disparities for Hispanic and Black voters.
TF-adjusted Jaro-Winkler lowers overall error rates but leaves ethnic disparities.
Embedding models increase overall false match rate but improve fairness across groups.
Abstract
Differential ethnic-based record linkage errors can bias epidemiologic estimates. Prior evidence often conflates heterogeneity in error mechanisms with unequal exposure to error. Using snapshots of the North Carolina Voter Registry (Oct 2011-Oct 2022), we derived empirical name-discrepancy profiles to parameterise realistic corruptions. From an Oct 2022 extract (n=848,566), we generated five replicate corrupted datasets under three settings that separately varied mechanism heterogeneity and exposure inequality, and linked records back to originals using unadjusted Jaro-Winkler, Term Frequency (TF)-adjusted Jaro-Winkler, and a cluster-based forename-embedding comparator combined with TF-adjusted surname comparison. We evaluated false match rate (FMR), missed match rate (MMR) and white-centric disparities. At a fixed MMR near 0.20, overall error rates and ethnic disparities diverged…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Census and Population Estimation · Authorship Attribution and Profiling
