Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion
Yuming Gu, Yizhi Wang, Yining Hong, Yipeng Gao, Hao Jiang, Angtian Wang, Bo Liu, Nathaniel S. Dennler, Zhengfei Kuang, Hao Li, Gordon Wetzstein, Chongyang Ma

TL;DR
Envision introduces a diffusion-based visual planning framework for embodied agents that explicitly models goals to generate coherent, goal-aligned trajectories for manipulation tasks, improving spatial consistency and physical plausibility.
Contribution
The paper presents a novel two-stage diffusion framework that explicitly incorporates goal images into visual planning, addressing spatial drift and goal misalignment issues in prior methods.
Findings
Achieves superior goal alignment and spatial consistency.
Produces physically plausible and smooth trajectories.
Enhances downstream robotic manipulation performance.
Abstract
Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
