Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vuli\'c

TL;DR
This paper introduces Visual Planning, a new paradigm that uses sequences of images for reasoning in visual tasks, outperforming text-based methods and offering a more intuitive approach for spatial and geometrical reasoning.
Contribution
The paper proposes a novel visual planning paradigm and a reinforcement learning framework, VPRL, enabling reasoning through images, which enhances performance on visual navigation tasks.
Findings
Visual Planning outperforms text-only reasoning methods.
The VPRL framework significantly improves planning in visual navigation tasks.
Visual representations provide a more natural reasoning modality for spatial tasks.
Abstract
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first" tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a…
Peer Reviews
Decision·ICLR 2026 Oral
1. Generating images makes the reasoning occur in the visual space rather than the textual space. Thus, the proposed method has the potential for more direct and better reasoning performance. 2. Through qualitative analysis of intermediate outputs, the paper shows that the proposed method can generate reasonable intermediate images, which is key for correct visual reasoning. 3. The experiment result demonstrates the proposed method outperforms strong baselines including private MLLMs.
1. It can be observed that the intermediate images are not perfect (e.g., in Fig. 3, first row, the player and goal tokens have artifacts). Thus, it would be interesting if the paper could show performance when the model reasons over high-quality images. For example, each time the model generates a new image, the corresponding high-quality image (rendered by the engine rather than generated by the model itself) is fed into the model. Would this lead to better performance? If so, the performance
1. New Paradigm: The paper introduces "Visual Planning" as a genuinely new paradigm for reasoning. 2. Good Empirical Results: The proposed method, Visual Planning via Reinforcement Learning (VPRL), significantly outperforms a wide range of baselines. 3. Methodological Robustness: The two-stage VPRL framework is well-designed and justified.
1. Reliance on an External Oracle for Rewards: A significant weakness in the method's detail is its reliance on non-learned, external modules to provide the reward signal. The VPRL framework depends on a "dynamics interpreter" and a "progress estimator". The appendix reveals this estimator is a Breadth First Search (BFS) algorithm —an oracle that has already solved the task and knows the optimal path from every state. The interpreter also uses rule-based pixel and IoU comparisons. This means the
The author investigates the potential of visual representation as a medium, which expands the research of LLMs to a broader area. The presentation of the paper is great, with a clear statement and an appropriate graph. The paper is the first attempt to investigate whether models can achieve planning purely through visual representations.
Can you provide any figures to clearly show the difference between language as a medium and visual as a medium in certain cases? It will be better if we can discuss any advantages of visual as a medium in real CV tasks, such as visual grounding. And the proposed methods, whether they can be easily transferred to 3D? If we finally want to get an MLLM, how do we add GRPO to the regular training receipt? When to align visuals with other modalities?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · AI-based Problem Solving and Planning
