TL;DR
DeFI introduces a disentangled pretraining framework for robot learning, separating forward and inverse dynamics to leverage diverse data sources, leading to improved performance on various benchmarks.
Contribution
The paper proposes a novel decoupled pretraining approach with separate models for forward and inverse dynamics, enhancing robot learning from large-scale, action-free videos.
Findings
Achieved state-of-the-art results on CALVIN ABC-D with an average task length of 4.51.
Attained 51.2% success rate on SimplerEnv-Fractal benchmark.
Reached 81.3% success rate in real-world deployment.
Abstract
Vision-language-action (VLA) models have shown great potential in building generalist robots, but still face a dilemma-misalignment of 2D image forecasting and 3D action prediction. Besides, such a vision-action entangled training manner limits model learning from large-scale, action-free web video data. To address these issues, we propose DeFI, a novel framework that Decouples visual Forward and Inverse dynamics pretraining to exploit respective data sources, wherein video generation and action prediction are disentangled. We introduce the General Forward Dynamics Model (GFDM), pretrained on diverse human and robot videos for future prediction, and the General Inverse Dynamics Model (GIDM), trained via self-supervised learning to infer latent actions from unlabeled video transitions. These models are then integrated into a unified architecture for end-to-end finetuning on downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
