VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models
Rohit Saxena, Alessandro Suglia, Pasquale Minervini

TL;DR
This paper introduces VLM-RobustBench, a comprehensive benchmark assessing the robustness of vision-language models against various real-world image distortions, revealing their spatial fragility and guiding future robustness improvements.
Contribution
The paper presents VLM-RobustBench, a new benchmark with 133 corrupted settings, and evaluates multiple VLMs, highlighting their weaknesses and proposing directions for enhancing robustness.
Findings
Visual severity poorly predicts difficulty of distortions.
Low-severity spatial distortions significantly degrade performance.
Geometric distortions cause the largest performance drops.
Abstract
Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
