# Evaluation of GPT-4 Accuracy in the Interpretation of Medical Imaging: Potential Benefits, Limitations, and the Future

**Authors:** Nikoloz Papiashvili, Christina Abshilava, Mohammad H Malik, Tinatin Dzindzibadze, Emeli J Anderson, Sopio Gagua, Vladimir Guruli, Kaveesha Amarasinghe, Nana Gonjilashvili, Irma Tchokhonelidze

PMC · DOI: 10.7759/cureus.87761 · Cureus · 2025-07-12

## TL;DR

This study evaluates how well GPT-4 interprets medical images like X-rays and CT scans, finding it more accurate for some types of images and conditions than others.

## Contribution

The study introduces a systematic evaluation of GPT-4's diagnostic accuracy in medical imaging across different modalities and disease types.

## Key findings

- X-ray images were interpreted 2.21 times more accurately than CT scans.
- Pelvic imaging accuracy was 6.25 times lower than abdominal imaging.
- Neoplastic conditions were interpreted 2.7 times less accurately than bleeding conditions.

## Abstract

Introduction

The implementation of artificial intelligence (AI) in radiology as a medical decision support system has the potential to enhance diagnostic accuracy and improve patient outcomes. This retrospective study aimed to evaluate the diagnostic capabilities of GPT-4o in interpreting radiological imaging, specifically X-ray, CT, and MRI images, across various organ systems and disease types.

Methods

A total of 377 cases were collected and presented to GPT-4o with a standardized prompt and no clinical context. The responses were assessed by three independent raters using a five-point rating system.

Results

X-ray imaging exhibited a 2.21 times higher chance, on average, of being interpreted accurately compared to CT scans (odds ratio (OR): 2.21; 95% confidence interval (CI): 1.33 - 3.69), while pelvic imaging had a 6.25 times lower chance, on average, of being interpreted accurately when compared to images of the abdomen (OR: 0.16; 95% CI: 0.02 - 0.56). Additionally, neoplastic conditions had a 2.7 times lower chance, on average, of being interpreted accurately compared to bleeding conditions (OR: 0.37; 95% CI: 0.16 - 0.84).

Conclusion

A bimodal distribution of median ratings highlights an overreliance on comparability to prior image encounters and emphasizes the need to develop a systematic approach to image analysis. Future research should prioritize eliminating hallucination, establishing standardized evaluation criteria, and exploring methods to integrate visual and text-based data in a balanced manner. Additionally, efforts should be directed towards enhancing dataset diversity to improve the model's overall accuracy and generalizability.

## Full-text entities

- **Diseases:** hallucination (MESH:D006212), bleeding (MESH:D006470), neoplastic (MESH:D009369)
- **Chemicals:** GPT-4 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12341017/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12341017/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/PMC12341017/full.md

---
Source: https://tomesphere.com/paper/PMC12341017