VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
Brigitta Malagurski T\"ortei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Ph\'uc H. L\^e Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan

TL;DR
VisRes Bench is a new benchmark that evaluates the visual reasoning abilities of vision-language models across different complexity levels, revealing significant limitations in their perceptual and relational reasoning skills.
Contribution
This paper introduces VisRes Bench, a comprehensive benchmark designed to systematically assess visual reasoning in VLMs without linguistic cues, highlighting their current shortcomings.
Findings
VLMs perform near random on perceptual tasks with subtle perturbations.
Models show limited ability in relational and compositional reasoning.
The benchmark isolates distinct reasoning abilities across three levels.
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Topic Modeling
