MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

Hafsteinn Einarsson

arXiv:2507.20395·cs.AI·July 29, 2025

MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

Hafsteinn Einarsson

PDF

TL;DR

MazeEval introduces a benchmark to evaluate spatial reasoning in language models through coordinate-based maze navigation tasks, revealing disparities in performance across models and languages, and highlighting limitations in current LLMs' spatial cognition.

Contribution

This paper presents MazeEval, a novel benchmark for assessing pure spatial reasoning in LLMs without visual input, and provides insights into models' cross-linguistic and complexity-related performance limitations.

Findings

01

OpenAI's O3 achieves perfect navigation up to 30x30 mazes.

02

Most models fail beyond 9x9 mazes, often looping excessively.

03

Spatial reasoning performance degrades in Icelandic, indicating language-dependent learning.

Abstract

As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for ensuring reliable real-world deployment. Despite advances in language understanding, current research lacks evaluation of how LLMs perform spatial navigation without visual cues, a fundamental requirement for agents operating with limited sensory information. This paper addresses this gap by introducing MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks. Our methodology employs a function-calling interface where models navigate mazes of varying complexity ( $5 \times 5$ to $15 \times 15$ grids) using only coordinate feedback and distance-to-wall information, excluding visual input to test fundamental spatial cognition. We evaluate eight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.