Image Generation as a Visual Planner for Robotic Manipulation
Ye Pang

TL;DR
This paper demonstrates that pretrained image generation models can be adapted to serve as visual planners for robotic manipulation, producing coherent videos from minimal supervision and without explicit temporal training.
Contribution
The authors introduce a novel framework that adapts image generators for robotic planning, showing they encode transferable temporal priors useful for generating robotic videos.
Findings
Pretrained image generators can produce smooth, coherent robotic videos.
Minimal supervision with LoRA finetuning suffices for adaptation.
Models generalize across multiple robotic datasets.
Abstract
Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
