# Evaluating the impact of discordant and missing demographic information on population health assessments using linked electronic health records and Census Bureau microdata

**Authors:** Derek Ouyang, Aubrey Limburg, David H. Rehkopf, Jacob Goldin, Robert L. Phillips, Victoria Udalova, Daniel E. Ho

PMC · DOI: 10.1371/journal.pdig.0001289 · PLOS Digital Health · 2026-03-17

## TL;DR

This study examines how missing or conflicting demographic data in health records affects population health assessments, showing significant impacts on estimating health outcomes for minority groups.

## Contribution

The study provides the first comprehensive assessment of demographic data quality in EHRs linked to Census data, revealing implications for health outcome estimation.

## Key findings

- 19.3% of patients had missing race and ethnicity data in EHRs, and 8.0% had discordant data.
- Discordance and missingness significantly affected health outcome estimates, especially for smaller racial/ethnic groups like NHPI.
- The study highlights the importance of improving demographic data collection to enhance the accuracy of population health assessments.

## Abstract

Administrative records are increasingly being used to study population-level outcomes, despite high rates of missingness and discrepancies (i.e., discordance) in demographic identifiers across different sources of data, which could reduce the quality of such assessments. Few studies have evaluated the relationship between these phenomena in administrative records and downstream impacts on assessments in consequential domains such as healthcare. We characterize patterns of discordance and missingness of race and ethnicity in electronic health records (EHR; 2010–2021) derived from the American Board of Family Medicine’s primary care registry, linked at the individual-level to restricted U.S. Census Bureau microdata (2000, 2010, 2020 Census; American Community Survey 2005–2022). Among 5.86 million linked patients, 19.3% were missing race and ethnicity information in EHRs, and 8.0% had race and ethnicity information that was recorded discordantly between the two sources, with the lowest discordance for White, Black, and Asian patients and the highest for American Indian and Alaska Native, Native Hawaiian and Pacific Islander (NHPI), and Multiracial patients. Missingness and discordance impacted estimation of group differences for all 50 health outcomes we consider, particularly for smaller racial/ethnic groups, such as a 24 percent change in NHPI Type 2 diabetes diagnosis rates. Our research has three major implications for the work of government agencies, academics, clinicians, and other stakeholders interested in utilizing EHRs for research purposes. First, we demonstrate how the quality of demographic data in administrative records can be comprehensively assessed, which previously has not been possible due to limitations in data access and linkage. Second, we systematically evaluate the impact of discordant and missing demographic information on our ability to accurately estimate disease prevalence. Third, we underscore the importance of evaluating discordance of demographic information both within and across different administrative domains.

Population-level assessments in consequential domains such as healthcare depend on large, high-quality administrative data. However, discordance and missingness of demographic information across records can distort analyses conducted by researchers and policymakers. We provide robust and comprehensive evidence and characterization of these patterns through a dataset of 5.86 million patients in the United States with linked information from electronic health records and restricted U.S. Census Bureau microdata. In particular, we demonstrate how these data quality issues can affect estimation of consequential group-level health outcomes, such as Type 2 diabetes diagnosis rates. Discordance and missingness are widespread and highly concentrated in specific administrative settings like primary care clinics, creating the potential for error at every geographic scale of assessment. However, much can be done to diagnose and mitigate discordance and missingness, particularly at the point when demographic information is collected. With more complete and concordant demographic information and improved data quality in electronic health records and other administrative records, government agencies, academics, and practitioners can more accurately measure and address health challenges.

## Linked entities

- **Diseases:** Type 2 diabetes (MONDO:0005148)

## Full-text entities

- **Diseases:** diabetes (MESH:D003920), Type 2 diabetes (MESH:D003924), hypertension (MESH:D006973), lipid disorder (MESH:D011017), hyperglycemia (MESH:D006943), cough (MESH:D003371), ACS (MESH:D003147), polyneuropathy (MESH:D011115), rhinitis (MESH:D012220)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12994837/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12994837/full.md

## References

81 references — full list in the complete paper: https://tomesphere.com/paper/PMC12994837/full.md

---
Source: https://tomesphere.com/paper/PMC12994837