# Comparative Performance of Multimodal and Unimodal Large Language Models Versus Multicenter Human Clinical Experts in Aortic Dissection Management

**Authors:** Evren Ekingen, Mete Ucdal

PMC · DOI: 10.3390/diagnostics16020323 · Diagnostics · 2026-01-19

## TL;DR

This study compares AI models and human experts in managing aortic dissection, finding similar performance levels across different clinical domains.

## Contribution

The study introduces a comparative analysis of multimodal and unimodal AI models against human experts in aortic dissection scenarios.

## Key findings

- The MLLM achieved 92.0% overall accuracy, excelling in diagnosis but performing lower in treatment and complication management.
- ChatGPT-5.2 (Unimodal) had 96.0% overall accuracy, outperforming the MLLM in treatment and complication management.
- Human physicians showed high accuracy, with cardiovascular surgeons matching the performance of the AI models.

## Abstract

Background: Multimodal large language models (MLLMs) integrating multiple AI systems and unimodal large language models (LLMs) represent distinct approaches to clinical decision support. Their comparative performance against human clinical experts in complex cardiovascular emergencies remains inadequately characterized. Objective: To compare the performance of a combined MLLM system (GPT-4V + Med-PaLM 2 + BioGPT), a unimodal LLM (ChatGPT-5.2), and human physicians from multiple centers (radiologists, emergency medicine specialists, cardiovascular surgeons) on aortic dissection clinical questions across diagnosis, treatment, and complication management domains. Methods: This multicenter cross-sectional study was conducted across five tertiary care centers in Turkey (Elazığ, Ankara, Antalya). A total of 25 validated multiple-choice questions were categorized into three domains: diagnosis (n = 8), treatment (n = 9), and complication management (n = 8). Questions were administered to the MLLM, ChatGPT-5.2 (Unimodal), and nine physicians from five centers: radiologists (n = 3), emergency medicine specialists (n = 3), and cardiovascular surgeons (n = 3). Statistical comparisons utilized chi-square tests. Results: Overall accuracy was 92.0% for the MLLM and 96.0% for ChatGPT-5.2 (Unimodal). Among human physicians, cardiovascular surgeons achieved 96.0%, radiologists 92.0%, and emergency medicine specialists 89.3%. The MLLM excelled in diagnosis (100%) but showed lower performance in treatment (88.9%) and complication management (87.5%). No significant differences were observed between AI models and human physician groups (all p > 0.05). Conclusions: Both the MLLM and unimodal ChatGPT-5.2 demonstrated performance within the range of human clinical experts in this controlled assessment of aortic dissection scenarios, though definitive conclusions regarding equivalence require larger-scale validation. These findings support further investigation of complementary roles for different AI architectures in clinical decision support.

## Full-text entities

- **Diseases:** Aortic Dissection (MESH:D000784), cardiovascular emergencies (MESH:D002318)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12839696/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12839696/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12839696/full.md

---
Source: https://tomesphere.com/paper/PMC12839696