Questioning the Stability of Visual Question Answering

Amir Rosenfeld; Neta Glazer; Ethan Fetaya

arXiv:2511.11206·cs.CV·November 17, 2025

Questioning the Stability of Visual Question Answering

Amir Rosenfeld, Neta Glazer, Ethan Fetaya

PDF

Open Access

TL;DR

This paper systematically investigates the robustness of Visual Language Models to minor, meaning-preserving visual and textual perturbations, revealing significant instability even in state-of-the-art systems and proposing stability as a predictor of correctness.

Contribution

It provides the first large-scale analysis of VLM robustness to benign perturbations and introduces stability as a predictor of model correctness, highlighting fundamental fragilities.

Findings

01

Modern VLMs are highly sensitive to small perturbations.

02

Sample stability correlates strongly with correctness.

03

Open-source model stability can predict larger model accuracy.

Abstract

Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning