Comparative evaluation of multimodal large language models for diagnostic accuracy in pediatric electrocardiography: a prospective comparative diagnostic accuracy study
Uğur Saraç, Ayşe Büşra Paydaş, Mustafa Gençeli, Talha Üstüntaş, Mehtap Yücel, Abdülkerim Çokbiçer, Fatih Şap, Tamer Baysal, Mehmet Burhan Oflaz

TL;DR
This study compared three AI models in interpreting pediatric ECGs and found they had limited diagnostic accuracy, suggesting they should only be used as screening tools with clinician oversight.
Contribution
First head-to-head comparison of multimodal LLMs in pediatric ECG interpretation using likelihood ratios as primary outcomes.
Findings
All three models showed limited rule-in utility with +LR values near 1.0.
Gemini achieved 100% sensitivity for emergency arrhythmias but with low specificity, indicating overcalling.
No model achieved clinically meaningful diagnostic accuracy for standalone use.
Abstract
We evaluated three multimodal LLMs, ChatGPT (GPT-5.2), Gemini 3, and Microsoft Copilot, in pediatric ECG interpretation, focusing on clinically significant abnormalities and emergency arrhythmias with likelihood ratios as primary outcome measures. This prospective comparative diagnostic accuracy study (STARD/STARD-AI) included 264 pediatric patients with 12-lead ECGs (November 2024–November 2025). De-identified images were submitted via standardized zero-shot prompt. Three blinded pediatric cardiologists established the reference diagnosis by majority-vote consensus. Cases were classified as Tier 1 (normal), Tier 2 (abnormal, non-urgent), or Tier 3 (urgent). Two binary endpoints were assessed: clinically significant abnormality (Tier 2 + 3 vs Tier 1) and emergency abnormality (Tier 3 vs Tier 1 + 2). Clinically significant abnormalities were present in 54.5% of patients. AUC values…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsECG Monitoring and Analysis · Cardiac electrophysiology and arrhythmias · Cardiac Arrhythmias and Treatments
