Video Prediction Policy: A Generalist Robot Policy with Predictive   Visual Representations

Yucheng Hu; Yanjiang Guo; Pengchao Wang; Xiaoyu Chen; Yen-Jen Wang,; Jianke Zhang; Koushil Sreenath; Chaochao Lu; Jianyu Chen

arXiv:2412.14803·cs.CV·May 6, 2025

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang,, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Video Prediction Policy (VPP), a robot control method leveraging the predictive capabilities of video diffusion models to improve generalization and success in complex manipulation tasks.

Contribution

The paper proposes VPP, a novel approach that uses pre-trained video diffusion models to incorporate future dynamics into robot policies, enhancing generalization and performance.

Findings

01

VPP improves generalization by 18.6% on the Calvin ABC-D benchmark.

02

VPP achieves a 31.6% increase in success rates on real-world dexterous tasks.

03

Fine-tuning video foundation models enhances future prediction accuracy for robotic control.

Abstract

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

roboterax/video-prediction-policy
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsDiffusion · Contrastive Learning