# Improving the value of population health data for health policy and decision-making using machine learning algorithms in EQ-5D-5L index estimation

**Authors:** Áron Hölgyesi, Zsombor Zrubka, Mehdi Neshat, Viktor Jáger, Áron Kincses, Levente Kovács, László Gulácsi, Seyedali Mirjalili, Márta Péntek

PMC · DOI: 10.1038/s41598-025-32123-6 · Scientific Reports · 2026-01-30

## TL;DR

This study uses machine learning to estimate health scores from population data, showing that adding health module data improves accuracy.

## Contribution

The study introduces a novel application of machine learning to estimate EQ-5D-5L scores using sociodemographic and health module data.

## Key findings

- AdaBoost outperformed other models in estimating EQ-5D-5L scores when using both sociodemographic and health module data.
- Including health module data significantly improved prediction accuracy compared to using only sociodemographic data.
- Data imputation negatively affected model performance in machine learning-based estimations.

## Abstract

This study aimed to estimate patient-level EQ-5D-5L index scores using routinely collected sociodemographic and Minimum European Health Module (MEHM) data from seven extensive population surveys (N = 9,324). Fourteen machine learning (ML) models were compared in five research scenarios using the recently developed G score. Based on the performance ranking shown across scenarios, AdaBoost emerged as the best model (mean rank 2.87), followed by Multilayer Perceptron (MLP) and XGBoost (mean ranks 2.94 and 3.60, respectively). AdaBoost achieved the best results when no imputation was done and both sociodemographic and MEHM data were included (G = 0.955), but its performance declined when the estimation was based solely on sociodemographics (G = 0.871). The results confirm that the EQ-5D-5L index can be well predicted from existing statistical data using ML methods and that the MEHM improves the estimation. Our findings also highlight the potentially undesirable effects of data imputation in ML-based estimations. The methods presented in this study enhance the usability of existing health data, giving analysts and decision-makers a practical way to populate health-economic evaluations when primary data collection is impractical or impossible. Nonetheless, even advanced ML algorithms have limitations, so direct EQ-5D-5L data collection should remain the preferred approach whenever feasible.

The online version contains supplementary material available at 10.1038/s41598-025-32123-6.

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** musculoskeletal disorders (MESH:D009140), activity limitation (MESH:D045745), ill health (MESH:D000071069), cancers (MESH:D009369), CHR (MESH:D015211), functional disability (MESH:D003291), standing illness (MESH:D002908), EQ-5D (OMIM:615065), Parkinson's disease (MESH:D010300), MEHM (MESH:D004675), long-standing illness (MESH:D000094024), ML (MESH:D007859)
- **Chemicals:** 3L (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12865033/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12865033/full.md

## References

8 references — full list in the complete paper: https://tomesphere.com/paper/PMC12865033/full.md

---
Source: https://tomesphere.com/paper/PMC12865033