VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Brigitta Malagurski T\"ortei; Yasser Dahou; Ngoc Dung Huynh; Wamiq Reyaz Para; Ph\'uc H. L\^e Khac; Ankit Singh; Sofian Chaybouti; Sanath Narayan

arXiv:2512.21194·cs.CV·December 25, 2025

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Brigitta Malagurski T\"ortei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Ph\'uc H. L\^e Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan

PDF

Open Access 1 Datasets

TL;DR

VisRes Bench is a new benchmark that evaluates the visual reasoning abilities of vision-language models across different complexity levels, revealing significant limitations in their perceptual and relational reasoning skills.

Contribution

This paper introduces VisRes Bench, a comprehensive benchmark designed to systematically assess visual reasoning in VLMs without linguistic cues, highlighting their current shortcomings.

Findings

01

VLMs perform near random on perceptual tasks with subtle perturbations.

02

Models show limited ability in relational and compositional reasoning.

03

The benchmark isolates distinct reasoning abilities across three levels.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tiiuae/visres_bench
dataset· 330 dl
330 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Topic Modeling