# Limitations in Chest X-Ray Interpretation by Vision-Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o

**Authors:** Chih-Hsiung Chen, Chang-Wei Chen, Kuang-Yu Hsieh, Kuo-En Huang, Hsien-Yung Lai

PMC · DOI: 10.3390/diagnostics16030376 · Diagnostics · 2026-01-23

## TL;DR

This study shows that vision-capable large language models can detect some chest X-ray findings but struggle with small lesions and accurate reasoning.

## Contribution

The study evaluates the performance of multiple vision-capable LLMs on chest X-ray interpretation without clinical context.

## Key findings

- Gemini 1.5 Pro had the highest overall detection rate among the tested models.
- LLMs struggled with small lesions, laterality recognition, and differential diagnosis reasoning.
- Models showed higher sensitivity for large, bilateral, and multiple lesions.

## Abstract

Background/Objectives: Interpretation of chest X-rays (CXRs) requires accurate identification of lesion presence, diagnosis, location, size, and number to be considered complete. However, the effectiveness of large language models with vision capabilities (LLMs) in performing these tasks remains uncertain. This study aimed to evaluate the image-only interpretation performance of LLMs in the absence of clinical information. Methods: A total of 247 CXRs covering 13 diagnostic categories, including pulmonary edema, cardiomegaly, lobar pneumonia, and other conditions, were evaluated using Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o. The text outputs generated by the LLMs were evaluated at two levels: (1) primary diagnosis accuracy across the 13 predefined diagnostic categories, and (2) identification of key imaging features described in the generated text. Primary diagnosis accuracy was assessed based on whether the model correctly identified the target diagnostic category and was classified as fully correct, partially correct, or incorrect according to predefined clinical criteria. Non-diagnostic imaging features, such as posteroanterior and anteroposterior (PA/AP) views, side markers, foreign bodies, and devices, were recorded and analyzed separately rather than being incorporated into the primary diagnostic scoring. Results: When fully and partially correct responses were treated as successful detections, vLLMs showed higher sensitivity for large, bilateral, multiple lesions and prominent devices, including acute pulmonary edema, lobar pneumonia, multiple malignancies, massive pleural effusions, and pacemakers, all of which demonstrated statistically significant differences across categories in chi-square analyses. Feature descriptions varied among models, especially in PA/AP views and side markers, though central lines were partially recognized. Across the entire dataset, Gemini 1.5 Pro achieved the highest overall detection rate, followed by Gemini 1.0, GPT-4o, and GPT-4 Turbo. Conclusions: Although LLMs were able to identify certain diagnoses and key imaging features, their limitations in detecting small lesions, recognizing laterality, reasoning through differential diagnoses, and using domain-specific expressions indicate that CXR interpretation without textual cues still requires further improvement.

## Linked entities

- **Diseases:** pulmonary edema (MONDO:0006932)

## Full-text entities

- **Diseases:** pleural effusions (MESH:D010996), pneumonia (MESH:D011014), malignancies (MESH:D009369), acute pulmonary edema (MESH:D011654), cardiomegaly (MESH:D006332)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12897257/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12897257/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12897257/full.md

---
Source: https://tomesphere.com/paper/PMC12897257