# Visual Large Language Models in Radiology: A Systematic Multimodel Evaluation of Diagnostic Accuracy and Hallucinations

**Authors:** Marc Sebastian von der Stück, Roman Vuskov, Simon Westfechtel, Robert Siepmann, Christiane Kuhl, Daniel Truhn, Sven Nebelung

PMC · DOI: 10.3390/life16010066 · Life · 2026-01-01

## TL;DR

This study evaluates how well visual large language models can interpret medical images and finds they are unreliable and prone to errors.

## Contribution

The study systematically compares multiple VLLMs in radiology, revealing their diagnostic limitations and context dependency.

## Key findings

- Gemini 2.0 outperformed other models with 29.2% diagnostic accuracy, while LLaVA had the lowest at 8.1%.
- Clinical context improved diagnostic accuracy by over 13 percentage points but increased reliance on text.
- Hallucinations occurred in 74.4% of cases and varied significantly between models.

## Abstract

Visual large language models (VLLMs) are discussed as potential tools for assisting radiologists in image interpretation, yet their clinical value remains unclear. This study provides a systematic and comprehensive comparison of general-purpose and biomedical VLLMs in radiology. We evaluated 180 representative clinical images with validated reference diagnoses (radiography, CT, MRI; 60 each) using seven VLLMs (ChatGPT-4o, Gemini 2.0, Claude Sonnet 3.7, Perplexity AI, Google Vision AI, LLaVA-1.6, LLaVA-Med-v1.5). Each model interpreted the image without and with clinical context. Mixed-effects logistic regression models assessed the influence of model, modality, and context on diagnostic performance and hallucinations (fabricated findings or misidentifications). Diagnostic accuracy varied significantly across all dimensions (p ≤ 0.001), ranging from 8.1% to 29.2% across models, with Gemini 2.0 performing best and LLaVA performing weakest. CT achieved the best overall accuracy (20.7%), followed by radiography (17.3%) and MRI (13.9%). Clinical context improved accuracy from 10.6% to 24.0% (p < 0.001) but shifted the model to rely more on textual information. Hallucinations were frequent (74.4% overall) and model-dependent (51.7–82.8% across models; p ≤ 0.004). Current VLLMs remain diagnostically unreliable, heavily context-biased, and prone to generating false findings, which limits their clinical suitability. Domain-specific training and rigorous validation are required before clinical integration can be considered.

## Full-text entities

- **Diseases:** Hallucinations (MESH:D006212)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12842777/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12842777/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/PMC12842777/full.md

---
Source: https://tomesphere.com/paper/PMC12842777