DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel, Pinto

TL;DR
DynaMo introduces an in-domain self-supervised approach for learning visual representations from expert demonstrations, significantly enhancing imitation learning efficiency without relying on out-of-domain data or complex augmentations.
Contribution
DynaMo jointly learns latent inverse and forward dynamics models from in-domain data, improving visual representation quality for visuomotor control tasks.
Findings
DynaMo outperforms prior self-supervised methods in imitation learning tasks.
Representation quality improves across various policy architectures.
Ablation studies highlight key components contributing to performance gains.
Abstract
Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVirtual Reality Applications and Impacts · Human Motion and Animation · Advanced Vision and Imaging
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Multi-Head Attention · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer
