ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models
Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng

TL;DR
ESPIRE is a new benchmark that evaluates embodied spatial reasoning in vision-language models using simulated robotic tasks, enabling detailed analysis of their spatial and action reasoning capabilities.
Contribution
The paper introduces ESPIRE, a comprehensive diagnostic benchmark for embodied spatial reasoning that combines simulation, task decomposition, and generative framing for VLM evaluation.
Findings
VLMs show varied spatial reasoning capabilities
ESPIRE reveals specific strengths and weaknesses of models
Benchmark facilitates targeted improvements in embodied spatial reasoning
Abstract
A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
