# AI in the Hot Seat: Head-to-Head Comparison of Large Language Models and Cardiologists in Emergency Scenarios

**Authors:** Vedat Cicek, Lili Zhao, Yalcin Tur, Ahmet Oz, Sahhan Kilic, Gorkem Durak, Faysal Saylik, Mert Ilker Hayiroglu, Tufan Cinar, Ulas Bagci

PMC · DOI: 10.3390/medsci14010033 · Medical Sciences · 2026-01-08

## TL;DR

This study compared how well large language models and cardiologists handle emergency heart scenarios, finding that some AI models performed as well as early-career doctors.

## Contribution

The first head-to-head comparison of LLMs and cardiologists in managing simulated cardiac emergencies using standardized scoring.

## Key findings

- ChatGPT outperformed early-career cardiologists in simulated cardiac emergency scenarios.
- LLMs showed significant variability in performance, with Gemini scoring the lowest.
- Some LLMs could serve as supplementary decision-support tools in interventional cardiology.

## Abstract

Background: The clinical applicability of large language models (LLMs) in high-stakes cardiac emergencies remains unexplored. This study evaluated how well advanced LLMs perform in managing complex catheterization laboratory (Cath lab) scenarios and compared their performance with that of interventional cardiologists. Methods and Results: A cross-sectional study was conducted from 20 June to 2 December 2024. Twelve challenging inferior myocardial infarction scenarios were presented to seven LLMs (ChatGPT, Gemini, LLAMA, Qwen, Bing, Claude, DeepSeek) and five early-career interventional cardiologists. Responses were standardized, anonymized, and evaluated by thirty experienced interventional cardiologists. Performance comparisons were analyzed using a linear mixed-effects model with correlation and reliability statistics. Physicians had an average reference score of 80.68 (95% CI 76.3–85.0). Among LLMs, ChatGPT ranked highest (87.4, 95% CI 82.5–92.3), followed by Claude (80.8, 95% CI 75.7–85.9) and DeepSeek (78.7, 95% CI 72.9–84.6). LLAMA (73.7), Qwen (66.2), and Bing (64.3) ranked lower, while Gemini scored the lowest (59.0). ChatGPT scored higher than the early-career physician comparator group (difference 6.69, 95% CI 0.00–13.37; p < 0.05), whereas Gemini, LLAMA, Qwen, and Bing performed significantly worse; Claude and DeepSeek showed no significant difference. Conclusions: This expanded assessment reveals significant variability in LLM performance. In this simulated setting, ChatGPT demonstrated performance comparable to that of early-career interventional cardiologists. These results suggest that LLMs could serve as supplementary decision-support tools in interventional cardiology under simulated conditions.

## Linked entities

- **Diseases:** inferior myocardial infarction (MONDO:0006803)

## Full-text entities

- **Diseases:** cardiac emergencies (MESH:D006331), inferior myocardial infarction (MESH:D056989)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12821637/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12821637/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/PMC12821637/full.md

---
Source: https://tomesphere.com/paper/PMC12821637