Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

Yuming Gu; Yizhi Wang; Yining Hong; Yipeng Gao; Hao Jiang; Angtian Wang; Bo Liu; Nathaniel S. Dennler; Zhengfei Kuang; Hao Li; Gordon Wetzstein; Chongyang Ma

arXiv:2512.22626·cs.CV·December 30, 2025

Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

Yuming Gu, Yizhi Wang, Yining Hong, Yipeng Gao, Hao Jiang, Angtian Wang, Bo Liu, Nathaniel S. Dennler, Zhengfei Kuang, Hao Li, Gordon Wetzstein, Chongyang Ma

PDF

Open Access

TL;DR

Envision introduces a diffusion-based visual planning framework for embodied agents that explicitly models goals to generate coherent, goal-aligned trajectories for manipulation tasks, improving spatial consistency and physical plausibility.

Contribution

The paper presents a novel two-stage diffusion framework that explicitly incorporates goal images into visual planning, addressing spatial drift and goal misalignment issues in prior methods.

Findings

01

Achieves superior goal alignment and spatial consistency.

02

Produces physically plausible and smooth trajectories.

03

Enhances downstream robotic manipulation performance.

Abstract

Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis