MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models
Hafsteinn Einarsson

TL;DR
MazeEval introduces a benchmark to evaluate spatial reasoning in language models through coordinate-based maze navigation tasks, revealing disparities in performance across models and languages, and highlighting limitations in current LLMs' spatial cognition.
Contribution
This paper presents MazeEval, a novel benchmark for assessing pure spatial reasoning in LLMs without visual input, and provides insights into models' cross-linguistic and complexity-related performance limitations.
Findings
OpenAI's O3 achieves perfect navigation up to 30x30 mazes.
Most models fail beyond 9x9 mazes, often looping excessively.
Spatial reasoning performance degrades in Icelandic, indicating language-dependent learning.
Abstract
As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for ensuring reliable real-world deployment. Despite advances in language understanding, current research lacks evaluation of how LLMs perform spatial navigation without visual cues, a fundamental requirement for agents operating with limited sensory information. This paper addresses this gap by introducing MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks. Our methodology employs a function-calling interface where models navigate mazes of varying complexity ( to grids) using only coordinate feedback and distance-to-wall information, excluding visual input to test fundamental spatial cognition. We evaluate eight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
