# Extending BEHRT to UK Biobank: assessing transformer model performance in clinical prediction

**Authors:** Yusuf Yildiz, Goran Nenadic, Meghna Jani, David A. Jenkins

PMC · DOI: 10.3389/fdgth.2026.1715506 · Frontiers in Digital Health · 2026-02-10

## TL;DR

This paper evaluates how well a transformer model called BEHRT performs in predicting clinical outcomes using UK Biobank data, finding that larger models and specific terminology choices improve long-term predictions.

## Contribution

The study systematically evaluates the impact of model size, terminology, and data splits on clinical prediction performance using UK Biobank data.

## Key findings

- Larger BEHRT models outperformed smaller ones in long-term diagnosis prediction (AUROC = 0.874 vs 0.858 at 5 years).
- CALIBER terminology improved precision scores compared to ICD-10 (Average Precision Score = 0.773 vs 0.678).
- Model performance varied significantly depending on design choices, especially in long-term prediction tasks.

## Abstract

Transformer-based models have shown strong potential for clinical prediction using electronic health record data, yet their performance can vary depending on modelling decisions and data characteristics.

In this study, we trained a BEHRT model on hospital-based UK Biobank data and evaluated its performance across four clinical prediction tasks, including next-visit diagnosis and longer-term diagnosis prediction up to five years. We exhaustively assessed the impact of model size, medical terminology (CALIBER vs ICD-10), and data split strategies.

The large model consistently outperformed the smaller one in long-term prediction tasks (AUROC = 0.874 vs 0.858 at 5 years), while differences were marginal in 6-months prediction tasks. Performance was also sensitive to the vocabulary size, with CALIBER model yielding higher average precision scores (Average Precision Score = 0.773 vs 0.678 using ICD-10).

Our results show that transformer models can achieve high predictive performance across diverse clinical scenarios, but outcomes vary considerably depending on modelling choices, particularly in long-term prediction tasks.

## Full-text entities

- **Genes:** SH2B2 (SH2B adaptor protein 2) [NCBI Gene 10603] {aka APS}
- **Diseases:** diabetes (MESH:D003920), ICD (OMIM:252500), Arthrosis (MESH:D010003), Essential Hypertension (MESH:D000075222), LLM (MESH:D007806), cardiovascular disease (MESH:D002318), MLM (MESH:D059468)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12929515/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12929515/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/PMC12929515/full.md

---
Source: https://tomesphere.com/paper/PMC12929515