Reliability of Gemini 2.5 Pro, ChatGPT 4.1, DeepSeek V3, and Claude Opus 4 in generating standardized CMR protocols

Răzvan-Andrei Licu; Giuseppe Muscogiuri; Davide Casartelli; Anca Bacârea; Marian Pop; Andra-Maria Licu; Daniele Sferratore; Alessandro Caruso; Marianna Mirchuk; Piotr Tarkowski; Jakub Byczkowski; Sandro Sironi

PMC · DOI:10.1186/s41747-025-00671-1·January 26, 2026

Reliability of Gemini 2.5 Pro, ChatGPT 4.1, DeepSeek V3, and Claude Opus 4 in generating standardized CMR protocols

Răzvan-Andrei Licu, Giuseppe Muscogiuri, Davide Casartelli, Anca Bacârea, Marian Pop, Andra-Maria Licu, Daniele Sferratore, Alessandro Caruso, Marianna Mirchuk, Piotr Tarkowski, Jakub Byczkowski, Sandro Sironi

PDF

Open Access

TL;DR

This study evaluates how well four AI models can generate standardized CMR protocols, finding that they show moderate to substantial agreement with expert guidelines.

Contribution

The study introduces a novel evaluation of LLMs for generating pathology-adapted CMR protocols under SCMR guidelines.

Findings

01

Gemini 2.5 Pro achieved the highest concordance with SCMR guidelines at 71.5%.

02

LLMs showed substantial agreement for mandatory CMR sequences (Fleiss κ ∈ [0.69, 0.74]).

03

Automation of CMR protocols could improve access to advanced cardiac diagnostics in primary healthcare.

Abstract

Artificial intelligence (AI) and large language models (LLMs) are increasingly integrated into radiology, offering new possibilities for advanced imaging techniques, including cardiovascular magnetic resonance (CMR). This proof-of-concept study assessed four high-performing LLMs (Gemini 2.5 Pro, ChatGPT 4.1, DeepSeek V3, and Claude Opus 4) on their ability to generate CMR protocols for 140 hypothetical cardiac cases. AI-generated protocols were compared against a reference standard established by a consensus between two experienced cardiovascular radiologists, following the Society for Cardiovascular Magnetic Resonance (SCMR) recommendations. Descriptive statistics were used to quantify the concordance of LLM-generated sequences with the SCMR guidelines. Statistical agreement was measured using Cohen and Fleiss κ statistics. Gemini 2.5 Pro achieved the highest concordance, aligning with…

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Cardiac Imaging and Diagnostics · COVID-19 diagnosis using AI