Medical Context Distorts Decisions in Clinical Vision Language Models

David Restrepo; Ira Ktena; Maria Vakalopoulou; Stergios Christodoulidis; Enzo Ferrante

arXiv:2605.17436·cs.CV·May 19, 2026

Medical Context Distorts Decisions in Clinical Vision Language Models

David Restrepo, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis, Enzo Ferrante

PDF

TL;DR

This study reveals that vision-language models for clinical decision support are heavily influenced by textual data, often ignoring visual evidence, and are sensitive to irrelevant information and prompt variations, raising concerns about their reliability.

Contribution

The paper systematically evaluates the failure modes of vision-language models in clinical settings, highlighting their over-reliance on text and vulnerability to prompt and data variations.

Findings

01

VLM decisions are dominated by text modality even with visual evidence available.

02

Models are heavily influenced by irrelevant clinical reports.

03

Minor prompt changes can reverse correct image-based predictions.

Abstract

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.