VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

Rohit Saxena; Alessandro Suglia; Pasquale Minervini

arXiv:2603.06148·cs.CV·March 9, 2026

VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

Rohit Saxena, Alessandro Suglia, Pasquale Minervini

PDF

Open Access

TL;DR

This paper introduces VLM-RobustBench, a comprehensive benchmark assessing the robustness of vision-language models against various real-world image distortions, revealing their spatial fragility and guiding future robustness improvements.

Contribution

The paper presents VLM-RobustBench, a new benchmark with 133 corrupted settings, and evaluates multiple VLMs, highlighting their weaknesses and proposing directions for enhancing robustness.

Findings

01

Visual severity poorly predicts difficulty of distortions.

02

Low-severity spatial distortions significantly degrade performance.

03

Geometric distortions cause the largest performance drops.

Abstract

Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)