Real-World Robot Learning with Masked Visual Pre-training

Ilija Radosavovic; Tete Xiao; Stephen James; Pieter Abbeel; Jitendra; Malik; Trevor Darrell

arXiv:2210.03109·cs.RO·October 7, 2022·27 cites

Real-World Robot Learning with Masked Visual Pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra, Malik, Trevor Darrell

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that self-supervised masked autoencoder pre-training on diverse in-the-wild videos significantly improves visual representations for real-world robotic tasks, outperforming prior methods and benefiting from large-scale training.

Contribution

It introduces a scalable masked autoencoder pre-training approach for robotic vision, showing its effectiveness across various tasks and embodiments, and highlights the advantages of large-scale pre-training.

Findings

01

Pre-trained representations outperform CLIP and supervised ImageNet pre-training.

02

Scaling up to 307M parameters and 4.5M images enhances performance.

03

Pre-training benefits are consistent across different robotic tasks.

Abstract

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ir413/mvp
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · Contrastive Language-Image Pre-training