# Performance of Microsoft Copilot in the Diagnostic Process of Pulmonary Embolism

**Authors:** Banu Arslan, Mehmet Necmeddin Sutasir, Ertugrul Altinbilek

PMC · DOI: 10.5811/westjem.24995 · Western Journal of Emergency Medicine · 2025-07-13

## TL;DR

This study evaluates Microsoft Copilot's ability to help diagnose pulmonary embolism using clinical data and finds it outperforms a traditional risk assessment tool.

## Contribution

The study introduces the use of Microsoft Copilot with GPT-4 for diagnosing pulmonary embolism and compares its performance to the Wells score.

## Key findings

- Copilot correctly included pulmonary embolism in the top 10 differential diagnoses in 94.3% of cases.
- Copilot showed better discriminatory power than the Wells score in risk assessment of pulmonary embolism.
- Copilot had higher sensitivity and specificity in risk categorization compared to the Wells score.

## Abstract

Patients with pulmonary embolism (PE) often present with non-specific signs and symptoms mimicking other conditions and complicating diagnosis. In this study we aimed to evaluate the performance of an artificial-intelligence tool, Microsoft Copilot, in the diagnostic process of PE, using clinical data including demographics, complaints, and vital signs.

We conducted this study using 140 clinical vignettes, including 70 patients with and 70 patients without PE. The vignettes were derived from published case reports within the last 10 years. We used Copilot for its free GPT-4 integration to analyze clinical data and answer two questions after each vignette. We compared Copilot’s ability to identify PE within the top 10 differential diagnoses, and its ability to predict the risk of PE when compared to the use of the Wells score by two independent investigators.

Copilot correctly included PE in the differential diagnosis in 94.3% of cases by listing it within the top 10 conditions. Risk assessment by Copilot yielded significantly higher levels in patients with PE (P<.05). No statistically significant difference was found in the Wells scores between patients with PE and without PE (P>.05). Copilot demonstrated better discriminatory power than the Wells score in risk assessment of PE (area under the curve 0.713 vs 0.583), with statistical significance (P<0.001 vs P=.091). Sensitivity, specificity, positive predictive value, and negative predictive value for discriminating between the combination of low- and intermediate- vs high-risk categories were 34%, 97.1%, 92.3%, and 59.6%, respectively.

This study explores the potential of Copilot as a tool in clinical decision-making, demonstrating a high rate of correctly identifying PE and improved performance over the Wells score. However, further validation in larger populations and real-world settings is crucial to fully realize its potential.

## Linked entities

- **Diseases:** pulmonary embolism (MONDO:0005279)

## Full-text entities

- **Diseases:** PE (MESH:D011655)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12342421/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12342421/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12342421/full.md

---
Source: https://tomesphere.com/paper/PMC12342421