Video Generators are Robot Policies

Junbang Liang; Pavel Tokmakov; Ruoshi Liu; Sruthi Sudhakar; Paarth Shah; Rares Ambrus; Carl Vondrick

arXiv:2508.00795·cs.RO·August 4, 2025

Video Generators are Robot Policies

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, Carl Vondrick

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Video Policy, a framework that uses video generation to learn robot policies, improving generalization and reducing data requirements by leveraging large-scale video models.

Contribution

The paper presents a novel end-to-end video and action generation framework for robot policy learning, enhancing robustness and sample efficiency with minimal demonstration data.

Findings

01

Improved generalization to unseen objects and tasks.

02

Enhanced robustness and sample efficiency in policy learning.

03

Superior performance over traditional behavior cloning.

Abstract

Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency. Our method shows strong generalization to unseen objects, backgrounds, and tasks, both in simulation and the real world. We further highlight that task success is…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

**Strong reported performance on benchmarks**: The paper reports high success rates on the RoboCasa (63% average) and Libero10 (94% average) benchmarks, outperforming several baselines, including some large-scale Vision-Language-Action (VLA) models, while using significantly less demonstration data. These results provide initial evidence for the potential of leveraging powerful video priors. **Good video generation quality**: The authors give a good demonstration of the capability of the vide

Weaknesses

**Overstated Novelty and Lack of Meaningful Comparison to Concurrent Work:** The paper frames the use of video models for policy learning as a novel contribution. This is a significant overstatement. The 2024-2025 period has seen a proliferation of work on this exact topic, including but not limited to concurrent models like **UVA**[1] , **UWM**[2] , and **VPP**[3] , which explore unified video-action architectures. The paper fails to properly situate itself within this crowded landscape, and it

Reviewer 02Rating 6Confidence 4

Strengths

- The paper is well-written. The introduction and method are easy to follow. - The experiments are very comprehensive to support the main claims in the paper. - The video prediction results look similar to the real world. - The results show that Video Policy is superior to baselines. - Using action-free data provides a potentially scalable way for data-driven robot learning.

Weaknesses

- The computation cost is high. The inference speed of the video model could slow down the policy rollout. Can the authors provide a comparison between Video Policy and other policy baselines? Can the authors propose some ways to accelerate the policy FPS? - It remains unclear if Video Policy still performs well in tasks with higher dynamics. The video model may show physics-inplausible results. It’s interesting if the authors can explore the behaviour of the policy and the video prediction mode

Reviewer 03Rating 4Confidence 4

Strengths

1. Strong empirical results, especially on RoboCasa and Libero10, showing improved success rates over baselines with a compact architecture. 2. Clarity and reproducibility: the paper is clearly written, with transparent architecture figures, ablations, and hyperparameter details.

Weaknesses

1. While the video diffusion model can be pretrained without actions, the action model still requires action-labeled data for fine-tuning. Therefore, the framework does not eliminate the need for action supervision. The title and framing (“action-free video learning”) are somewhat overstated. 2. The action decoder is trained on limited demonstrations and does not inherit the generalization ability of the video diffusion model. As a result, its performance still depends heavily on the diversity a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis