Probing Visual Planning in Image Editing Models

Zhimu Zhou; Yanpeng Zhao; Qiuyu Liao; Bo Zhao; Xiaojian Ma

arXiv:2604.22868·cs.CV·April 28, 2026

Probing Visual Planning in Image Editing Models

Zhimu Zhou, Yanpeng Zhao, Qiuyu Liao, Bo Zhao, Xiaojian Ma

PDF

1 Repo

TL;DR

This paper introduces EAR, a new approach to visual planning in image editing that simplifies reasoning to a single-step transformation, and evaluates models using a novel abstract dataset called AMAZE.

Contribution

The work proposes EAR, an editing-as-reasoning paradigm, and introduces AMAZE, a procedurally generated dataset for probing visual planning in models.

Findings

01

Models struggle in zero-shot settings on AMAZE tasks.

02

Fine-tuning improves generalization to larger and out-of-domain puzzles.

03

Even the best models lag behind human efficiency in visual reasoning.

Abstract

Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

spatigen/amaze
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.