# Comparative evaluation of multimodal large language models for diagnostic accuracy in pediatric electrocardiography: a prospective comparative diagnostic accuracy study

**Authors:** Uğur Saraç, Ayşe Büşra Paydaş, Mustafa Gençeli, Talha Üstüntaş, Mehtap Yücel, Abdülkerim Çokbiçer, Fatih Şap, Tamer Baysal, Mehmet Burhan Oflaz

PMC · DOI: 10.1007/s00431-026-06874-x · 2026-03-24

## TL;DR

This study compared three AI models in interpreting pediatric ECGs and found they had limited diagnostic accuracy, suggesting they should only be used as screening tools with clinician oversight.

## Contribution

First head-to-head comparison of multimodal LLMs in pediatric ECG interpretation using likelihood ratios as primary outcomes.

## Key findings

- All three models showed limited rule-in utility with +LR values near 1.0.
- Gemini achieved 100% sensitivity for emergency arrhythmias but with low specificity, indicating overcalling.
- No model achieved clinically meaningful diagnostic accuracy for standalone use.

## Abstract

We evaluated three multimodal LLMs, ChatGPT (GPT-5.2), Gemini 3, and Microsoft Copilot, in pediatric ECG interpretation, focusing on clinically significant abnormalities and emergency arrhythmias with likelihood ratios as primary outcome measures. This prospective comparative diagnostic accuracy study (STARD/STARD-AI) included 264 pediatric patients with 12-lead ECGs (November 2024–November 2025). De-identified images were submitted via standardized zero-shot prompt. Three blinded pediatric cardiologists established the reference diagnosis by majority-vote consensus. Cases were classified as Tier 1 (normal), Tier 2 (abnormal, non-urgent), or Tier 3 (urgent). Two binary endpoints were assessed: clinically significant abnormality (Tier 2 + 3 vs Tier 1) and emergency abnormality (Tier 3 vs Tier 1 + 2). Clinically significant abnormalities were present in 54.5% of patients. AUC values ranged from 0.550 to 0.623, reflecting modest discrimination. For the clinically significant endpoint, + LR values were 2.05 (ChatGPT), 1.26 (Gemini), and 1.21 (Copilot); − LR values were 0.68, 0.55, and 0.81, indicating limited rule-in and insufficient rule-out utility. For the emergency endpoint, Gemini achieved 100% sensitivity (95% CI = 85.1–100.0) with − LR 0.07 (95% CI = 0.00–1.12) in a small subgroup (n = 22); however, specificity of 30.2% and + LR of 1.40 indicate overcalling rather than diagnostic precision. No model achieved clinically meaningful rule-in utility for either endpoint.

Conclusions: Current multimodal LLMs showed limited diagnostic utility in pediatric ECG interpretation, with + LR values near 1.0 across both endpoints. Standalone deployment is not supported; these tools may at most serve as adjunctive screening aids under clinician oversight.
What is Known:• Deep learning algorithms trained on large ECG datasets perform well in adult populations, but evidence in pediatric ECG interpretation is limited.• General-purpose LLMs show variable accuracy in medical examinations; reliability in subspecialty domains such as pediatric cardiology remains unproven.What is New:• This is the[FCA1] first head-to-head comparative diagnostic accuracy study of multimodal LLMs in pediatric ECG evaluation, using likelihood ratios as primary outcome measures.• All three LLMs showed limited rule-in utility (+LR near 1.0); Gemini achieved potentially meaningful rule-out performance for emergency arrhythmias (−LR = 0.07), but with wide confidence intervals reflecting the small emergency subgroup (n = 22).• Gemini’s 100% sensitivity in the emergency subgroup reflects overcalling (specificity 30.2%) consistent with a triage/screening behavior rather than diagnostic precision.

What is Known:

• Deep learning algorithms trained on large ECG datasets perform well in adult populations, but evidence in pediatric ECG interpretation is limited.

• General-purpose LLMs show variable accuracy in medical examinations; reliability in subspecialty domains such as pediatric cardiology remains unproven.

What is New:

• This is the[FCA1] first head-to-head comparative diagnostic accuracy study of multimodal LLMs in pediatric ECG evaluation, using likelihood ratios as primary outcome measures.

• All three LLMs showed limited rule-in utility (+LR near 1.0); Gemini achieved potentially meaningful rule-out performance for emergency arrhythmias (−LR = 0.07), but with wide confidence intervals reflecting the small emergency subgroup (n = 22).

• Gemini’s 100% sensitivity in the emergency subgroup reflects overcalling (specificity 30.2%) consistent with a triage/screening behavior rather than diagnostic precision.

The online version contains supplementary material available at 10.1007/s00431-026-06874-x.

## Full-text entities

- **Diseases:** long QT syndrome (MESH:D008133), Right bundle branch block (MESH:D002037), junctional reciprocating tachycardia (MESH:D054139), hyperkalemia (MESH:D006947), ECG abnormalities (MESH:D000014), Supraventricular tachycardia (MESH:D013617), Atrioventricular (MESH:D054537), hallucinations (MESH:D006212), Wolff-Parkinson-White (MESH:D014927), ectopic atrial tachycardia (MESH:D013612), ventricular dysfunction (MESH:D018754), aortic stenosis (MESH:D001024), left ventricular dysfunction (MESH:D018487), LLMs (MESH:D007806), arrhythmia (MESH:D001145), ventricular tachycardia (MESH:D017180), Congenital heart disease (MESH:D006330), Left ventricular hypertrophy (MESH:D017379), sinus bradycardia (MESH:D012804), sinus tachycardia (MESH:D013616), Premature atrial contraction (MESH:D018880), Atrial septal defect (MESH:D006344), atrial fibrillation (MESH:D001281), hypertrophic cardiomyopathy (MESH:D002312), Sinus arrhythmia (MESH:D001146), ventricular hypertrophy (MESH:D024741), Premature ventricular contraction (MESH:D018879)
- **Chemicals:** Gemini (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13009043/full.md

---
Source: https://tomesphere.com/paper/PMC13009043