# Evaluation of the Accuracy of Probabilistic Record Linkage Across Sociodemographic Categories in 4 Databases: Exploratory Study

**Authors:** Cristina Barboi, Fangqian Ouyang, Lauren Lembcke, Andrew Martin, Ashley Griffith, Katie S Allen, Xiaochun Li, Huiping Xu, Shaun J Grannis

PMC · DOI: 10.2196/78622 · JMIR Formative Research · 2026-02-26

## TL;DR

This study found that patient record matching accuracy varies by age, sex, race, and ethnicity due to data quality issues, raising concerns about fairness in health data.

## Contribution

The study introduces a novel evaluation of probabilistic record linkage accuracy across sociodemographic groups using multiple data sources and fairness considerations.

## Key findings

- Matching accuracy declines in groups with higher data missingness, especially in newborn screening and death master file datasets.
- Race and ethnicity showed the largest drops in accuracy due to high missingness and low data diversity.
- Both low and high informational diversity in data were linked to reduced matching performance.

## Abstract

Accurate patient record linkage is essential for clinical care, health information exchange, research, and public health surveillance. However, linkage accuracy may vary across demographic groups due to differences in data completeness, quality, and the structural factors underlying how demographic information is captured.

This study aimed to explore whether probabilistic patient matching accuracy varies by age, sex, race, and ethnicity and to identify potential sources of bias that may influence matching performance.

We used 4 Indiana data sources—the Indiana Network for Patient Care, Newborn Screening, Social Security Administration Death Master File, and Marion County Public Health Department—and applied a modified Fellegi-Sunter probabilistic linkage algorithm accommodating missing data under a missing at random assumption. Gold standard match status was established through dual manual review with adjudication. For each dataset, matching sensitivity, positive predictive value, and F1-scores were estimated and stratified by age, sex, race, and ethnicity. Data completeness, distinct value ratio, and Shannon entropy were assessed to characterize data quality. Ninety-five percent bootstrap CIs were used to assess significance.

The algorithm-matching F1-score was greater than 0.82 for all age strata, ranging from 0.88 to 0.97 for sex, 0.85 to 0.99 for race, and 0.88 to 0.99 for ethnicity. Sensitivity ranged from 0.70 to 0.97 across age strata, 0.76 to 0.97 across sex, 0.85 to 0.99 across race, and 0.85 to 0.989 across ethnicity. Lower sensitivity and F1-scores were consistently observed in strata with greater missingness or discordance, particularly in Newborn Screening and Social Security Administration Death Master File. Race and ethnicity exhibited the highest missingness and lowest informational diversity, coinciding with the largest declines in accuracy. Shannon entropy and distinct value ratio varied across demographic groups and were strongly associated with performance, indicating that both low and excessively high informational diversity can impair matching.

Probabilistic patient matching accuracy is not uniform across demographics and is strongly influenced by data quality and completeness. Although overall matching performance, as assessed by the F1-score, remained above 0.8, it varied across datasets when stratified by sociodemographic characteristics. Sociodemographic data missingness is associated with lower matching accuracy, raising equity and ethical concerns for clinical, research, and public health applications. Routine demographic-stratified evaluations of matching accuracy, improved standardization of sociodemographic data, and fairness-aware linkage methods are essential to prevent the amplification of structural inequities in linked health datasets.

## Full-text entities

- **Diseases:** MCPHD (MESH:C000719203), infectious disease (MESH:D003141), allergies (MESH:D004342), INPC (MESH:D003428), Death (MESH:D003643)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12945093/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12945093/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/PMC12945093/full.md

---
Source: https://tomesphere.com/paper/PMC12945093