# Benchmarking Two Leading Large Language Models for Pulmonary Embolism Identification on CT Pulmonary Angiography

**Authors:** Nitin Chetla, Tamer Hage, Swapna Vaja, Andrew Bouras, Shivam Patel, Harshita Kacham, Vinisha Bonagiri, Nasif Zaman

PMC · DOI: 10.7759/cureus.92719 · Cureus · 2025-09-19

## TL;DR

This study compares two large language models in detecting pulmonary embolism from CT scans, finding they have opposite strengths in sensitivity and specificity.

## Contribution

The paper benchmarks GPT-4o and Gemini 2.0 for pulmonary embolism detection using a medical exam-style prompt format.

## Key findings

- GPT-4o had high sensitivity (81%) but low specificity (10%) for detecting pulmonary embolism.
- Gemini 2.0 showed high specificity (98%) but low sensitivity (14%) for the same task.
- Both models exhibited significant diagnostic biases, suggesting limitations in current LLMs for radiological diagnosis.

## Abstract

Introduction

Recent advances in large language models (LLMs) such as GPT-4 Omni (GPT-4o) (OpenAI, Inc., San Francisco, CA) and Gemini 2.0 (Google, Inc., Mountain View, CA) have enabled their application in medical image interpretation. This study evaluates the ability of these LLMs to detect pulmonary embolism (PE) on computed tomography pulmonary angiography (CTPA) images using simplified prompts modeled after the United States Medical Licensing Examination (USMLE) Step 1 examination format.

Methods

Digital Imaging and Communications in Medicine (DICOM) images from the Radiological Society of North America (RSNA) PE Detection Challenge 2020 were converted to Portable Network Graphics (PNG) format and analyzed using GPT-4o and Gemini 2.0. A total of 12,533 PE-positive and 11,835 PE-negative slices were evaluated using GPT-4o, while Gemini 2.0 analyzed 12,302 PE-positive and 12,063 PE-negative slices. Images were presented using application programming interface (API) prompts designed to elicit categorical responses. Performance metrics, including accuracy, precision, recall, and F1 score, were calculated for each model.

Results

GPT-4o demonstrated high sensitivity but low specificity, correctly identifying 38/47 PE-positive cases (81%) but only 5/49 PE-negative cases (10%). Gemini 2.0 showed the opposite pattern, correctly identifying 50/51 PE-negative cases (98%) but only 7/49 PE-positive cases (14%). F1 scores reflected this divergence, with GPT-4o performing better on positive cases (0.59 versus 0.16) and Gemini 2.0 on negative cases (0.70 versus 0.25).

Conclusion

GPT-4o and Gemini 2.0 exhibited opposing diagnostic biases, GPT-4o favoring sensitivity and Gemini 2.0 favoring specificity, highlighting current limitations of LLMs in radiological diagnosis. While promising, these models require further refinement before integration into clinical workflows.

## Linked entities

- **Diseases:** pulmonary embolism (MONDO:0005279)

## Full-text entities

- **Diseases:** PE (MESH:D011655)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12535752/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12535752/full.md

## References

15 references — full list in the complete paper: https://tomesphere.com/paper/PMC12535752/full.md

---
Source: https://tomesphere.com/paper/PMC12535752