Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks
Weijiang Li, Yilin Zhu, Rajarshi Das, Parijat Dube

TL;DR
This paper evaluates whether large language models can build internal spatial world models using maze tasks, revealing their limitations in spatial reasoning and the influence of representation formats.
Contribution
It provides systematic evidence that LLMs' spatial reasoning is representation-dependent and not indicative of true internal spatial world models.
Findings
Gemini-2.5-Flash achieves 80-86% accuracy on small mazes with tokenized adjacency.
Performance drops to 16-34% with visual grid formats, showing format-dependent reasoning.
Models fail to leverage semantic understanding for consistent spatial computations.
Abstract
Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction. Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16-34% with visual grid formats, which is a 2-5x difference, suggesting representation-dependent rather than format-invariant spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
