# Opportunities and Challenges of Visual Large Language Models in Imaging Diagnostics: Lessons from Brain Metastasis Detection in Clinical MRI

**Authors:** Christian Nelles, Nour Abou Zeid, Robert Terzis, Andra-Iza Iuga, Lukas Görtz, Marvin A. Spurek, David Maintz, Simon Lennartz, Jonathan Kottlors

PMC · DOI: 10.3390/diagnostics16050749 · Diagnostics · 2026-03-03

## TL;DR

This study tested AI models for detecting brain metastases in MRI scans and found they were good at identifying cancer but had many false positives and errors.

## Contribution

The study evaluates the diagnostic accuracy of two vLLMs for brain metastasis detection in clinical MRI with combined imaging and textual input.

## Key findings

- Both models had 100% sensitivity but very low specificity (8% for GPT-4o, 4% for Sonnet 3.5).
- GPT-4o outperformed Sonnet 3.5 in sequence identification accuracy.
- Both models hallucinated additional lesions in 12% of cases.

## Abstract

Background/Objectives: To evaluate the diagnostic accuracy of two visual large language models (vLLMs), GPT-4o (OpenAI) and Claude Sonnet 3.5 (Anthropic), for detecting brain metastases in routine MRI using combined imaging and textual input. Methods: This retrospective study included 31 patients with and 46 without brain metastases with underlying melanoma (n = 24), lung cancer (n = 23), breast cancer (n = 17), or renal cell carcinoma (n = 13). In total, 100 MRI examinations (50 with, 50 without metastases) were provided to both vLLMs using a single representative slice per sequence, together with clinical history and the referring question. The generated free-text reports were evaluated for detection accuracy, overdiagnosis, correct sequence recognition, anatomical localization, lesion laterality, and lesion size estimation. Results: Both vLLMs showed perfect sensitivity (100% for both) but very low specificity (GPT-4o: 8%, Sonnet 3.5: 4%; p = 0.625), resulting in low diagnostic accuracy (GPT-4o: 54%, Sonnet 3.5: 52%; p = 0.625). Sequence identification was highly accurate in both models, with GPT-4o performing significantly better (100% vs. 93%; p < 0.05). Identification of the anatomical brain region (70% vs. 72%; p = 1.00) and lesion laterality (62% vs. 76%; p = 0.189) was comparable. Both models hallucinated additional lesions in 12% of cases. Lesion size measurements showed no significant differences between the models or in comparison with the radiologist. Conclusions: GPT-4o and Claude Sonnet 3.5 can generate radiological reports and detect brain metastases with excellent sensitivity, but their very low specificity, frequent hallucinations, and limited spatial reliability currently preclude clinical application. Future work should address how the balance between visual and textual input influences diagnostic behavior in vLLMs.

## Linked entities

- **Diseases:** melanoma (MONDO:0005105), lung cancer (MONDO:0005138), breast cancer (MONDO:0004989), renal cell carcinoma (MONDO:0005086)

## Full-text entities

- **Diseases:** hallucinations (MESH:D006212), melanoma (MESH:D008545), lung cancer (MESH:D008175), Brain Metastasis (MESH:D009362), renal cell carcinoma (MESH:D002292), breast cancer (MESH:D001943), brain metastases (MESH:D001932)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12984547/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12984547/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12984547/full.md

---
Source: https://tomesphere.com/paper/PMC12984547