Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

Yueru Jia; Jiaming Liu; Shengbang Liu; Rui Zhou; Wanhe Yu; Yuyang Yan; Xiaowei Chi; Yandong Guo; Boxin Shi; Shanghang Zhang

arXiv:2512.03044·cs.RO·March 25, 2026

Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

Yueru Jia, Jiaming Liu, Shengbang Liu, Rui Zhou, Wanhe Yu, Yuyang Yan, Xiaowei Chi, Yandong Guo, Boxin Shi, Shanghang Zhang

PDF

Open Access

TL;DR

Video2Act introduces a dual-system framework that leverages video diffusion models for more coherent and physically consistent robotic action learning, significantly improving success rates in simulation and real-world tasks.

Contribution

It presents a novel dual-system approach integrating spatial and motion-aware representations from VDMs with a diffusion transformer for enhanced robotic policy learning.

Findings

01

Surpasses previous state-of-the-art methods by 7.7% in simulation

02

Achieves 21.7% higher success rate in real-world tasks

03

Demonstrates strong generalization capabilities across tasks

Abstract

Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning