Image Generation as a Visual Planner for Robotic Manipulation

Ye Pang

arXiv:2512.00532·cs.CV·December 2, 2025

Image Generation as a Visual Planner for Robotic Manipulation

Ye Pang

PDF

Open Access

TL;DR

This paper demonstrates that pretrained image generation models can be adapted to serve as visual planners for robotic manipulation, producing coherent videos from minimal supervision and without explicit temporal training.

Contribution

The authors introduce a novel framework that adapts image generators for robotic planning, showing they encode transferable temporal priors useful for generating robotic videos.

Findings

01

Pretrained image generators can produce smooth, coherent robotic videos.

02

Minimal supervision with LoRA finetuning suffices for adaptation.

03

Models generalize across multiple robotic datasets.

Abstract

Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning