DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control
Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, Shuo Yang

TL;DR
DiT4DiT introduces a unified video-action diffusion model that leverages spatiotemporal video features for improved robot control, achieving state-of-the-art results with higher efficiency and better generalization in simulation and real-world tasks.
Contribution
The paper presents a novel end-to-end diffusion-based framework that jointly models video dynamics and actions, enhancing robot learning and generalization.
Findings
Achieves 98.6% success on LIBERO benchmark
Outperforms prior methods with over 10x data efficiency
Demonstrates superior real-world robot performance and zero-shot generalization
Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning
