Real-World Robot Learning with Masked Visual Pre-training
Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra, Malik, Trevor Darrell

TL;DR
This paper demonstrates that self-supervised masked autoencoder pre-training on diverse in-the-wild videos significantly improves visual representations for real-world robotic tasks, outperforming prior methods and benefiting from large-scale training.
Contribution
It introduces a scalable masked autoencoder pre-training approach for robotic vision, showing its effectiveness across various tasks and embodiments, and highlights the advantages of large-scale pre-training.
Findings
Pre-trained representations outperform CLIP and supervised ImageNet pre-training.
Scaling up to 307M parameters and 4.5M images enhances performance.
Pre-training benefits are consistent across different robotic tasks.
Abstract
In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · Contrastive Language-Image Pre-training
