Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

Peiyan Li; Yixiang Chen; Yuan Xu; Jiabing Yang; Xiangnan Wu; Jun Guo; Nan Sun; Long Qian; Xinghang Li; Xin Xiao; Jing Liu; Nianfeng Liu; Tao Kong; Yan Huang; Liang Wang; Tieniu Tan

arXiv:2604.03181·cs.RO·April 6, 2026

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

Peiyan Li, Yixiang Chen, Yuan Xu, Jiabing Yang, Xiangnan Wu, Jun Guo, Nan Sun, Long Qian, Xinghang Li, Xin Xiao, Jing Liu, Nianfeng Liu, Tao Kong, Yan Huang, Liang Wang, Tieniu Tan

PDF

TL;DR

This paper introduces MV-VDP, a multi-view video diffusion policy that models 3D spatio-temporal environment dynamics for robotic manipulation, enabling data-efficient, robust, and generalizable actions.

Contribution

The novel MV-VDP approach jointly predicts multi-view heatmap and RGB videos, aligning video pretraining with action finetuning for improved manipulation performance.

Findings

01

MV-VDP performs complex tasks with only ten demonstrations.

02

It outperforms existing models on Meta-World and real robots.

03

The model generalizes well and predicts realistic future videos.

Abstract

Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.