Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

Lars Benedikt Kaesberg; Tianyu Yang; Niklas Bauer; Terry Ruas; Jan Philip Wahle; Bela Gipp

arXiv:2604.09338·cs.AI·April 13, 2026

Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

Lars Benedikt Kaesberg, Tianyu Yang, Niklas Bauer, Terry Ruas, Jan Philip Wahle, Bela Gipp

PDF

TL;DR

Spatial-Gym is a new environment for evaluating spatial reasoning in agents through step-by-step pathfinding tasks, revealing current model limitations and guiding future improvements.

Contribution

Introduces Spatial-Gym, a sequential decision environment for spatial reasoning, and provides a comprehensive evaluation of models versus humans in this setting.

Findings

01

Models perform significantly worse than humans on spatial tasks.

02

Step-by-step reasoning improves weaker models but hinders stronger ones.

03

Vision-based inputs drastically reduce model success rates.

Abstract

Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.