Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel; Shraddhaa Mohan; Hanlin Mai; Unnat Jain; Svetlana Lazebnik; Yunzhu Li

arXiv:2507.00990·cs.RO·May 14, 2026

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li

PDF

1 Video

TL;DR

This paper presents RIGVid, a system enabling robots to learn complex manipulation tasks by imitating AI-generated videos filtered by vision-language models, eliminating the need for physical demonstrations.

Contribution

The work introduces a novel approach where AI-generated videos serve as supervision for robotic manipulation, bypassing traditional physical training or demonstrations.

Findings

01

Filtered generated videos are as effective as real demonstrations.

02

Performance improves with higher quality video generation.

03

Generated videos outperform keypoint prediction methods.

Abstract

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations· slideslive