iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs
Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, Elia Bruni

TL;DR
iVISPAR is a new interactive benchmark that evaluates the spatial reasoning abilities of vision-language models using a variant of the sliding tile puzzle across multiple modalities, revealing current models' limitations in complex spatial tasks.
Contribution
The paper introduces iVISPAR, a comprehensive multimodal benchmark for assessing spatial reasoning in VLMs, and provides an extensive evaluation highlighting their performance gaps.
Findings
VLMs perform better on 2D tasks than 3D or text-based tasks.
Current VLMs struggle with complex spatial configurations.
Humans outperform VLMs significantly in spatial reasoning tasks.
Abstract
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multimodal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. \mbox{iVISPAR} is based on a variant of the sliding tile puzzle, a classic problem that demands logical planning, spatial awareness, and multi-step reasoning. The benchmark supports visual 3D, 2D, and text-based input modalities, enabling comprehensive assessments of VLMs' planning and reasoning skills. We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task's complexity and feasibility for humans. Results indicate that while VLMs perform better on 2D tasks compared to 3D or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsConstraint Satisfaction and Optimization
