6 Fingers, 1 Kidney: Natural Adversarial Medical Images Reveal Critical Weaknesses of Vision-Language Models
Leon Mayer, Piotr Kalinowski, Caroline Ebersbach, Marcel Knopp, Tim R\"adsch, Evangelia Christodoulou, Annika Reinke, Fiona R. Kolbinger, Lena Maier-Hein

TL;DR
This paper introduces AdversarialAnatomyBench, a benchmark revealing that current vision-language models perform poorly on rare anatomical variants, exposing a critical weakness in their generalization capabilities for medical imaging.
Contribution
The paper presents the first benchmark for natural adversarial anatomical variants, highlighting significant performance drops of state-of-the-art models on rare medical cases.
Findings
Model accuracy drops from 74% to 29% on atypical anatomy.
Top models still experience 41-51% performance decline.
Neither scaling nor bias-aware prompts mitigate issues.
Abstract
Vision-language models are increasingly integrated into clinical workflows. However, existing benchmarks primarily assess performance on common anatomical presentations and fail to capture the challenges posed by rare variants. To address this gap, we introduce AdversarialAnatomyBench, the first benchmark comprising naturally occurring rare anatomical variants across diverse imaging modalities and anatomical regions. We call such variants that violate learned priors about "typical" human anatomy natural adversarial anatomy. Benchmarking 22 state-of-the-art VLMs with AdversarialAnatomyBench yielded three key insights. First, when queried with basic medical perception tasks, mean accuracy dropped from 74% on typical to 29% on atypical anatomy. Even the best-performing models, GPT-5, Gemini 2.5 Pro, and Llama 4 Maverick, showed performance drops of 41-51%. Second, model errors closely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning
