TL;DR
This paper introduces EAR, a new approach to visual planning in image editing that simplifies reasoning to a single-step transformation, and evaluates models using a novel abstract dataset called AMAZE.
Contribution
The work proposes EAR, an editing-as-reasoning paradigm, and introduces AMAZE, a procedurally generated dataset for probing visual planning in models.
Findings
Models struggle in zero-shot settings on AMAZE tasks.
Fine-tuning improves generalization to larger and out-of-domain puzzles.
Even the best models lag behind human efficiency in visual reasoning.
Abstract
Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
