Questioning the Stability of Visual Question Answering
Amir Rosenfeld, Neta Glazer, Ethan Fetaya

TL;DR
This paper systematically investigates the robustness of Visual Language Models to minor, meaning-preserving visual and textual perturbations, revealing significant instability even in state-of-the-art systems and proposing stability as a predictor of correctness.
Contribution
It provides the first large-scale analysis of VLM robustness to benign perturbations and introduces stability as a predictor of model correctness, highlighting fundamental fragilities.
Findings
Modern VLMs are highly sensitive to small perturbations.
Sample stability correlates strongly with correctness.
Open-source model stability can predict larger model accuracy.
Abstract
Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
