# Evaluating the sampling effect of propensity score matching for reducing selection bias in medical data

**Authors:** Minji Roh, Sujin Yum, Gihun Joo, Jae-Won Jang, Hyeonseung Im

PMC · DOI: 10.3389/fpubh.2026.1747762 · Frontiers in Public Health · 2026-02-10

## TL;DR

This paper evaluates how propensity score matching helps reduce selection bias in medical data to improve machine learning model performance.

## Contribution

The study systematically evaluates PSM alongside various resampling techniques across multiple medical datasets with differing levels of selection bias.

## Key findings

- PSM reduces selection bias and maintains stable classification performance in moderately imbalanced datasets.
- PSM improves model internal validity and generalization in real-world medical applications.
- Extreme selection bias or overly restrictive PSM can degrade model performance.

## Abstract

In real-world medical data, selection bias can significantly impact the performance of machine learning models, potentially leading to distorted outcomes. However, research aimed at mitigating selection bias remains relatively limited.

In this study, we evaluate the effectiveness of Propensity Score Matching (PSM) in reducing selection bias and assessing its impact on classification performance in imbalanced medical data. Specifically, we apply PSM alongside five undersampling, three oversampling, and three hybrid sampling techniques to three medical datasets: rapidly progressive dementia prediction (ADNI, n = 628, events = 51), hypothyroidism prediction (UCI, n = 3,772, events = 3,481), and cardiovascular disease prediction (Kaggle, n = 253,680, events = 23,893), each exhibiting varying degrees of demographic selection bias. We train and compare six classification models to assess the impact of each resampling technique on model performance. The magnitude of selection bias is quantified using the standardized mean difference (SMD), while model performance is assessed using the Area Under the Receiver Operating Characteristic Curve (AUROC), the Area Under the Precision-Recall Curve (AUPRC), accuracy, precision, recall, F1-score, specificity, calibration curves, Brier score, and decision curve analysis.

The results indicate that PSM reduces SMD within the dataset, maintains stable classification performance, and enhances the internal validity of the model under conditions of limited or moderate demographic imbalance.

These advantages suggest its potential for improving model reliability and facilitating better generalization to external datasets in real-world medical applications. However, in datasets with extreme selection bias or when overly restrictive matching is applied, PSM can degrade model performance, underscoring the importance of choosing strategies that account for dataset characteristics.

## Linked entities

- **Diseases:** dementia (MONDO:0001627), hypothyroidism (MONDO:0005420), cardiovascular disease (MONDO:0004995)

## Full-text entities

- **Diseases:** Dementia (MESH:D003704), Heart Disease (MESH:D006331), RPD (MESH:C538458), Thyroid Disease (MESH:D013959), cognitive impairment (MESH:D003072), HypoT (MESH:D007037), MI (MESH:D009203), CVD (MESH:D002318), MR (MESH:D008944), CHD (MESH:D003327), neurodegenerative condition (MESH:D019636), AD (MESH:D000544), MCI (MESH:D060825)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12929392/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12929392/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC12929392/full.md

---
Source: https://tomesphere.com/paper/PMC12929392