From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
Alberto G. Rodriguez Salgado

TL;DR
This paper introduces MazeBench, a benchmark revealing that large multimodal models rely on brute-force search strategies rather than genuine spatial reasoning, despite high accuracy scores.
Contribution
The study uncovers that models use a token enumeration approach akin to BFS, highlighting the gap between high accuracy and true visual spatial understanding.
Findings
Models translate images into text grids and enumerate paths, consuming many tokens.
High accuracy does not equate to human-like spatial reasoning.
MazeBench isolates visual extraction from search strategies, revealing reliance on enumeration.
Abstract
How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 2020 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
