iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs

Julius Mayer; Mohamad Ballout; Serwan Jassim; Farbod Nosrat Nezami; Elia Bruni

arXiv:2502.03214·cs.CL·October 1, 2025

iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs

Julius Mayer, Mohamad Ballout, Serwan Jassim, Farbod Nosrat Nezami, Elia Bruni

PDF

Open Access 1 Repo 1 Video

TL;DR

iVISPAR is a new interactive benchmark that evaluates the spatial reasoning abilities of vision-language models using a variant of the sliding tile puzzle across multiple modalities, revealing current models' limitations in complex spatial tasks.

Contribution

The paper introduces iVISPAR, a comprehensive multimodal benchmark for assessing spatial reasoning in VLMs, and provides an extensive evaluation highlighting their performance gaps.

Findings

01

VLMs perform better on 2D tasks than 3D or text-based tasks.

02

Current VLMs struggle with complex spatial configurations.

03

Humans outperform VLMs significantly in spatial reasoning tasks.

Abstract

Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multimodal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. \mbox{iVISPAR} is based on a variant of the sliding tile puzzle, a classic problem that demands logical planning, spatial awareness, and multi-step reasoning. The benchmark supports visual 3D, 2D, and text-based input modalities, enabling comprehensive assessments of VLMs' planning and reasoning skills. We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task's complexity and feasibility for humans. Results indicate that while VLMs perform better on 2D tasks compared to 3D or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SharkyBamboozle/iVISPAR
noneOfficial

Videos

iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs· underline

Taxonomy

TopicsConstraint Satisfaction and Optimization