Video Generators are Robot Policies
Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, Carl Vondrick

TL;DR
This paper introduces Video Policy, a framework that uses video generation to learn robot policies, improving generalization and reducing data requirements by leveraging large-scale video models.
Contribution
The paper presents a novel end-to-end video and action generation framework for robot policy learning, enhancing robustness and sample efficiency with minimal demonstration data.
Findings
Improved generalization to unseen objects and tasks.
Enhanced robustness and sample efficiency in policy learning.
Superior performance over traditional behavior cloning.
Abstract
Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency. Our method shows strong generalization to unseen objects, backgrounds, and tasks, both in simulation and the real world. We further highlight that task success is…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
**Strong reported performance on benchmarks**: The paper reports high success rates on the RoboCasa (63% average) and Libero10 (94% average) benchmarks, outperforming several baselines, including some large-scale Vision-Language-Action (VLA) models, while using significantly less demonstration data. These results provide initial evidence for the potential of leveraging powerful video priors. **Good video generation quality**: The authors give a good demonstration of the capability of the vide
**Overstated Novelty and Lack of Meaningful Comparison to Concurrent Work:** The paper frames the use of video models for policy learning as a novel contribution. This is a significant overstatement. The 2024-2025 period has seen a proliferation of work on this exact topic, including but not limited to concurrent models like **UVA**[1] , **UWM**[2] , and **VPP**[3] , which explore unified video-action architectures. The paper fails to properly situate itself within this crowded landscape, and it
- The paper is well-written. The introduction and method are easy to follow. - The experiments are very comprehensive to support the main claims in the paper. - The video prediction results look similar to the real world. - The results show that Video Policy is superior to baselines. - Using action-free data provides a potentially scalable way for data-driven robot learning.
- The computation cost is high. The inference speed of the video model could slow down the policy rollout. Can the authors provide a comparison between Video Policy and other policy baselines? Can the authors propose some ways to accelerate the policy FPS? - It remains unclear if Video Policy still performs well in tasks with higher dynamics. The video model may show physics-inplausible results. It’s interesting if the authors can explore the behaviour of the policy and the video prediction mode
1. Strong empirical results, especially on RoboCasa and Libero10, showing improved success rates over baselines with a compact architecture. 2. Clarity and reproducibility: the paper is clearly written, with transparent architecture figures, ablations, and hyperparameter details.
1. While the video diffusion model can be pretrained without actions, the action model still requires action-labeled data for fine-tuning. Therefore, the framework does not eliminate the need for action supervision. The title and framing (“action-free video learning”) are somewhat overstated. 2. The action decoder is trained on limited demonstrations and does not inherit the generalization ability of the video diffusion model. As a result, its performance still depends heavily on the diversity a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis
