6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

Leon Mayer; Piotr Kalinowski; Caroline Ebersbach; Marcel Knopp; Tim R\"adsch; Evangelia Christodoulou; Annika Reinke; Fiona R. Kolbinger; Lena Maier-Hein

arXiv:2512.04238·cs.CV·December 5, 2025

6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models

Leon Mayer, Piotr Kalinowski, Caroline Ebersbach, Marcel Knopp, Tim R\"adsch, Evangelia Christodoulou, Annika Reinke, Fiona R. Kolbinger, Lena Maier-Hein

PDF

Open Access

TL;DR

This paper introduces AdversarialAnatomyBench, a benchmark revealing that current vision-language models perform poorly on rare anatomical variants, exposing a critical weakness in their generalization capabilities for medical imaging.

Contribution

The paper presents the first benchmark for natural adversarial anatomical variants, highlighting significant performance drops of state-of-the-art models on rare medical cases.

Findings

01

Model accuracy drops from 74% to 29% on atypical anatomy.

02

Top models still experience 41-51% performance decline.

03

Neither scaling nor bias-aware prompts mitigate issues.

Abstract

Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about "typical" human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning