# Performance of Physicians and AI Systems on Pulmonary Thromboembolism Questions

**Authors:** Evren Ekingen, Mete Ucdal

PMC · DOI: 10.7759/cureus.100476 · Cureus · 2025-12-31

## TL;DR

This study compared how well AI systems and doctors perform on questions about pulmonary thromboembolism, finding that some AI models matched or exceeded specialist physicians in accuracy.

## Contribution

The study introduces a direct comparison of AI and physician performance on a complex medical topic using a structured assessment with non-inferiority thresholds.

## Key findings

- Claude 2 matched top specialists and outperformed emergency medicine physicians by 15.4 percentage points.
- ChatGPT-4 and Med-PaLM were non-inferior to internal medicine and pulmonary specialists within a 10% margin.
- All groups struggled with nuanced treatment scenarios and guideline-based edge cases.

## Abstract

Background: AI systems are increasingly being evaluated for their potential role in medical decision-making. Pulmonary thromboembolism (PTE) represents an ideal test domain for evaluating AI clinical reasoning capabilities due to its high prevalence, significant mortality risk, and clinical complexity requiring integration of validated risk stratification tools, multiple imaging modalities, and nuanced treatment algorithms across diverse patient populations, including pregnancy, malignancy, and renal impairment. We compared the performance of large language models (LLMs) with specialist physicians on PTE knowledge assessment.

Methods: We administered 25 multiple-choice questions covering the diagnosis, treatment, complications, and management of PTE to 17 physicians (seven emergency medicine, five internal medicine, and five pulmonary specialists) and three AI systems: ChatGPT-4 (OpenAI, San Francisco, CA, USA), Claude 2 (Anthropic, San Francisco, CA, USA), and Google Med-PaLM (Google Research, Mountain View, CA, USA). Questions were categorized into four domains: diagnosis, treatment, complications, and management/ICU. We calculated overall accuracy and domain-specific performance. We applied a pre-specified non-inferiority margin of 10 percentage points, a threshold consistent with FDA guidance for medical device comparison studies and prior AI-physician trials, representing the maximum clinically acceptable performance gap that would still support practical utility in adjunctive clinical decision support while maintaining appropriate safety standards.

Results: Internal medicine and pulmonary specialists achieved the highest scores (80% each), matched by Claude 2 (80%). ChatGPT-4 and MedPalm scored 72% each, while emergency medicine specialists averaged 64.6%. Claude 2 significantly outperformed emergency medicine physicians (+15.4 percentage points, p<0.05). ChatGPT-4 and MedPalm demonstrated non-inferiority to internal medicine and pulmonary specialists (-8 percentage points, within the 10% margin). All groups performed well on diagnostic questions but struggled with nuanced treatment and management scenarios. AI systems showed particular difficulty with guideline-based edge cases and cancer-associated thromboembolism management.

Conclusions: Advanced AI systems can achieve specialist-level performance on structured medical knowledge assessments. Claude 2 matched top specialists and exceeded emergency medicine performance, while other AI systems were non-inferior to domain experts. These findings support the potential utility of AI in medical education and clinical decision support while highlighting areas requiring further development.

## Linked entities

- **Diseases:** cancer (MONDO:0004992)

## Full-text entities

- **Diseases:** renal impairment (MESH:D007674), PTE (MESH:D011655), cancer (MESH:D009369), thromboembolism (MESH:D013923)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12856955/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12856955/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12856955/full.md

---
Source: https://tomesphere.com/paper/PMC12856955