Vision-Language Models Can't See the Obvious

Yasser Dahou; Ngoc Dung Huynh; Phuc H. Le-Khac; Wamiq Reyaz Para; Ankit Singh; Sanath Narayan

arXiv:2507.04741·cs.CV·July 8, 2025

Vision-Language Models Can't See the Obvious

Yasser Dahou, Ngoc Dung Huynh, Phuc H. Le-Khac, Wamiq Reyaz Para, Ankit Singh, Sanath Narayan

PDF

TL;DR

This paper introduces Saliency Benchmark (SalBench), a new test suite for evaluating vision-language models' ability to detect obvious visual features and anomalies that humans easily perceive.

Contribution

SalBench provides a novel, focused benchmark with three tasks to assess LVLMs' perceptual abilities on low-level visual features and anomalies.

Findings

01

LVLMs perform poorly on obvious visual anomalies

02

GPT-4o achieves only 47.6% accuracy on simple tasks

03

SalBench highlights limitations in current LVLM perceptual capabilities

Abstract

We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones. This benchmark focuses on low-level features including color, intensity, and orientation, which are fundamental to human visual processing. Our SalBench consists of images that highlight rare, unusual, or unexpected elements within scenes, and naturally draw human attention. It comprises three novel tasks for evaluating the perceptual capabilities of LVLM: Odd-One-Out Detection, Referring Odd-One-Out, and Visual Referring Odd-One-Out. We perform a comprehensive evaluation of state-of-the-art LVLM using SalBench and our findings reveal a surprising limitation: LVLM struggle to identify seemingly obvious visual anomalies, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.