DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Teli Ma; Jia Zheng; Zifan Wang; Chunli Jiang; Andy Cui; Junwei Liang; Shuo Yang

arXiv:2603.10448·cs.RO·March 24, 2026

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, Shuo Yang

PDF

Open Access

TL;DR

DiT4DiT introduces a unified video-action diffusion model that leverages spatiotemporal video features for improved robot control, achieving state-of-the-art results with higher efficiency and better generalization in simulation and real-world tasks.

Contribution

The paper presents a novel end-to-end diffusion-based framework that jointly models video dynamics and actions, enhancing robot learning and generalization.

Findings

01

Achieves 98.6% success on LIBERO benchmark

02

Outperforms prior methods with over 10x data efficiency

03

Demonstrates superior real-world robot performance and zero-shot generalization

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning