VILP: Imitation Learning with Latent Video Planning

Zhengtong Xu; Qiang Qiu; Yu She

arXiv:2502.01784·cs.RO·February 5, 2025

VILP: Imitation Learning with Latent Video Planning

Zhengtong Xu, Qiang Qiu, Yu She

PDF

Open Access 1 Repo

TL;DR

VILP introduces a latent video diffusion model for efficient, time-consistent video generation to enhance robot policy learning, reducing data needs and supporting multi-modal actions.

Contribution

The paper presents a novel latent video diffusion approach for predictive robot videos, improving efficiency, temporal consistency, and multi-modal action representation in imitation learning.

Findings

01

VILP outperforms existing methods in training costs and inference speed.

02

Generated videos exhibit high temporal consistency across multiple views.

03

VILP maintains robust policy performance with less high-quality task-specific data.

Abstract

In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This paper introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96x160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhengtongxu/vilp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Human Motion and Animation

MethodsDiffusion