# Multimodal large language model versus emergency physicians for burn assessment: a prospective non-inferiority study

**Authors:** Ahmet Aykut, Ali Rıza Karayıl, Cem Yıldırım, Ertuğ Günsoy, Mehmet Tatlı, Murat Avcı

PMC · DOI: 10.1186/s13049-026-01577-6 · Scandinavian Journal of Trauma, Resuscitation and Emergency Medicine · 2026-02-05

## TL;DR

A multimodal AI model matched emergency doctors in estimating burn size but underperformed in judging burn depth, suggesting it could assist in some emergency burn assessments.

## Contribution

First prospective non-inferiority study comparing a multimodal large language model to emergency physicians for burn assessment in real clinical settings.

## Key findings

- The model's mean absolute TBSA error was 1.40 percentage points, meeting non-inferiority criteria compared to physicians.
- Depth agreement with expert panels was low for the model (kappa 0.14) compared to physicians (kappa 0.65).
- 87.5% of model cases were within ±3 percentage points of the reference standard for TBSA estimation.

## Abstract

Accurate burn size and depth assessment at first contact guides fluid resuscitation, referral, and operative planning, yet both tasks show meaningful inter-clinician variability. General-purpose multimodal large language models may offer scalable, image-based decision support in emergency care, but prospective benchmarking against clinicians and a robust reference standard remains limited.

We conducted a prospective, single-centre diagnostic accuracy and agreement study in a tertiary emergency department (22 July–8 September 2025). Consecutive acute burn presentations (< 24 h) were screened; protocol-conformant cases contributed standardized three-view photographs per anatomically distinct burn region. A multimodal large language model generated region-level estimates of total body surface area (TBSA) contribution and burn depth class. Eighteen emergency physicians independently rated the same images and minimal metadata, blinded to model and reference outputs. A three-member expert panel served as the reference standard by consensus. The primary endpoint was non-inferiority of the model versus the physician median for region-level absolute TBSA error relative to the panel, with a pre-specified margin of 3 percentage points, using patient-level cluster bootstrap for inference. Secondary endpoints included TBSA agreement and depth agreement (quadratic-weighted kappa).

Of 413 screened presentations, 52 patients were enrolled, yielding 64 analyzable burn region-cases (35 pediatric, 29 adult). The model’s mean absolute TBSA error versus the panel was 1.40 percentage points (median 1.00); 87.5% of cases were within ± 3 percentage points and 98.4% within ± 5. The physician median had a mean absolute error of 0.89 percentage points (median 0.75). The paired non-inferiority analysis met the pre-specified criterion (Hodges–Lehmann median Δ = 0.25; one-sided 95% upper bound = 0.50), indicating the model was non-inferior to physicians for TBSA estimation. In contrast, depth agreement versus the panel was slight for the model (quadratic-weighted kappa 0.14), with systematic underestimation of deeper burns, while physician consensus showed substantially higher agreement (quadratic-weighted kappa 0.65).

In this prospective emergency department evaluation, a general-purpose multimodal model achieved non-inferior performance to emergency physicians for region-level TBSA estimation but performed substantially worse for burn depth classification. These findings support a narrowly defined adjunct role for TBSA estimation, while depth-dependent decisions should remain clinician-led and require further method development and external validation.

The online version contains supplementary material available at 10.1186/s13049-026-01577-6.

## Linked entities

- **Diseases:** burn (MONDO:0043519)

## Full-text entities

- **Diseases:** burn (MESH:D002056)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12969848/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12969848/full.md

---
Source: https://tomesphere.com/paper/PMC12969848