VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for   Effective Robot Manipulation

Youpeng Wen; Junfan Lin; Yi Zhu; Jianhua Han; Hang Xu; Shen Zhao,; Xiaodan Liang

arXiv:2411.09153·cs.CV·November 15, 2024

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao,, Xiaodan Liang

PDF

Open Access 1 Video

TL;DR

VidMan leverages video diffusion models and a two-stage training process inspired by neuroscience to improve robot manipulation by understanding implicit dynamics, outperforming existing models on key benchmarks.

Contribution

Introduces VidMan, a novel two-stage framework that pre-trains on video data and adapts to inverse dynamics, enhancing robot manipulation with implicit dynamics understanding.

Findings

01

Outperforms state-of-the-art GR-1 by 11.7% on CALVIN benchmark.

02

Achieves over 9% precision improvement on small-scale OXE dataset.

03

Demonstrates the effectiveness of world models in robot action prediction.

Abstract

Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging

MethodsAdapter · Diffusion