Analyzing the Sensitivity of Vision Language Models in Visual Question Answering

Monika Shah; Sudarshan Balaji; Somdeb Sarkhel; Sanorita Dey; Deepak Venugopal

arXiv:2507.21335·cs.CV·July 30, 2025

Analyzing the Sensitivity of Vision Language Models in Visual Question Answering

Monika Shah, Sudarshan Balaji, Somdeb Sarkhel, Sanorita Dey, Deepak Venugopal

PDF

1 Video

TL;DR

This paper investigates how Vision Language Models handle violations of conversational principles in visual question answering, revealing their performance drops with question modifications, thus highlighting their limitations.

Contribution

The study introduces a novel approach to assess VLMs' sensitivity to conversational violations by adding modifiers to questions, comparing their responses to human-like understanding.

Findings

01

VLM performance decreases with question modifiers

02

VLMs struggle with violations of conversational principles

03

Study highlights limitations of current VLMs in handling nuanced questions

Abstract

We can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice's maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice's maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Analyzing the Sensitivity of Vision Language Models in Visual Question Answering· underline