VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation
Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao,, Xiaodan Liang

TL;DR
VidMan leverages video diffusion models and a two-stage training process inspired by neuroscience to improve robot manipulation by understanding implicit dynamics, outperforming existing models on key benchmarks.
Contribution
Introduces VidMan, a novel two-stage framework that pre-trains on video data and adapts to inverse dynamics, enhancing robot manipulation with implicit dynamics understanding.
Findings
Outperforms state-of-the-art GR-1 by 11.7% on CALVIN benchmark.
Achieves over 9% precision improvement on small-scale OXE dataset.
Demonstrates the effectiveness of world models in robot action prediction.
Abstract
Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Vision and Imaging
MethodsAdapter · Diffusion
