# Performance of large language model in cross-specialty medical scenarios

**Authors:** Zhen Cui, Wuzheng Liu, Xuan Tian, Conglei You, Xiangyu Meng, Huijuan Zhang, Kangzi Gong, Xu Wang, Jun Wu

PMC · DOI: 10.1186/s12967-025-07577-x · 2025-12-22

## TL;DR

This study compares the diagnostic and therapeutic accuracy of three large language models across 12 medical specialties, finding that GPT-4o performs best for diagnosis but shows mixed results for treatment recommendations.

## Contribution

The study provides a systematic evaluation of LLMs' cross-specialty diagnostic and therapeutic performance using standardized clinical cases and physician assessments.

## Key findings

- GPT-4o showed superior diagnostic accuracy compared to GPT-3.5-Turbo and Claude-3-Sonnet across 12 medical specialties.
- GPT-4o's therapeutic recommendations were highly variable, performing significantly better than GPT-3.5-Turbo but not Claude-3-Sonnet.
- LLMs demonstrated high consistency in diagnostic outputs but inconsistent therapeutic performance, limiting clinical adoption.

## Abstract

Large language models (LLMs) demonstrate transformative potential in healthcare, yet their diagnostic and therapeutic accuracy across medical specialties remains inadequately characterized.

This study aimed to compare diagnostic and therapeutic capabilities of GPT-4o, GPT-3.5-Turbo, Claude-3-Sonnet across 12 medical specialties using standardized clinical vignettes. 50 PubMed-derived clinical cases between 2007 and 2024 were assessed. Two board-certified physicians independently evaluated LLMs outputs, with a senior clinician adjudicating discrepancies. All LLMs received identical text-based case descriptions with or without images, generating free-text diagnostic and therapeutic recommendations for blinded, randomized evaluation.

Among the three evaluated LLMs, GPT-4o demonstrated superior diagnostic accuracy (median 10; IQR, 7.5–10), outperforming Claude-3-Sonnet (median 8; IQR, 2.8–10; P = .02) and GPT-3.5-Turbo (median 4; IQR, 1–9.3; P < .0001). A narrow IQR and minimal variation (SD = 2.9; range = 5.0) reflected high consistency in diagnostic outputs across diverse medical fields. For therapeutic recommendations, GPT-4o (median 10, IQR 0–10) outperformed GPT-3.5-Turbo (median 0, IQR 0–6.3; P = .0005) but showed no significant advantage over Claude-3-Sonnet (median 5, IQR 0–10; P = .45).

This study demonstrates that advanced LLMs, particularly GPT-4o, have significant potential to support clinical diagnostics, showing high accuracy and consistency across specialties. However, their inconsistent performance in generating therapeutic recommendations presents a major barrier to clinical adoption.

The online version contains supplementary material available at 10.1186/s12967-025-07577-x.

## Full-text entities

- **Diseases:** Rheumatology (MESH:D012216), artificial hallucination (MESH:D006212), Infectious Diseases (MESH:D003141), Oncology (MESH:D000072716), lesion (MESH:D009059), LLMs (MESH:D007806)
- **Chemicals:** Claude-3-Sonnet (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12903388/full.md

---
Source: https://tomesphere.com/paper/PMC12903388