Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Mohammed Safi Ur Rahman Khan; Sanjay Suryanarayanan; Tushar Anand; Mitesh M. Khapra

arXiv:2604.21523·cs.CV·April 24, 2026

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra

PDF

2 Repos 1 Datasets

TL;DR

This paper systematically evaluates the reliability of vision-language models used as evaluators, revealing significant blind spots and limitations in detecting various output errors across image-to-text and text-to-image tasks.

Contribution

It introduces a comprehensive benchmark with targeted perturbations to assess Evaluator VLMs and uncovers their substantial blind spots and limitations.

Findings

01

Evaluator VLMs often fail to detect output perturbations exceeding 50%

02

They struggle with fine-grained compositional and spatial errors

03

Pairwise comparison methods are more reliable but still imperfect

Abstract

Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

ai4bharat/Focus
dataset· 1.8k dl
1.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.