From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Alberto G. Rodriguez Salgado

arXiv:2603.26839·cs.LG·May 14, 2026

From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Alberto G. Rodriguez Salgado

PDF

1 Datasets

TL;DR

This paper introduces MazeBench, a benchmark revealing that large multimodal models rely on brute-force search strategies rather than genuine spatial reasoning, despite high accuracy scores.

Contribution

The study uncovers that models use a token enumeration approach akin to BFS, highlighting the gap between high accuracy and true visual spatial understanding.

Findings

01

Models translate images into text grids and enumerate paths, consuming many tokens.

02

High accuracy does not equate to human-like spatial reasoning.

03

MazeBench isolates visual extraction from search strategies, revealing reliance on enumeration.

Abstract

How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20 $\times$ 20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

albertoRodriguez97/MazeBench
dataset· 126 dl
126 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.