R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair; Aravind Rajeswaran; Vikash Kumar; Chelsea Finn; Abhinav; Gupta

arXiv:2203.12601·cs.RO·November 21, 2022·82 cites

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav, Gupta

PDF

Open Access 1 Repo

TL;DR

This paper introduces R3M, a universal visual representation trained on human videos, that significantly enhances data-efficient robotic manipulation learning both in simulation and real-world tasks.

Contribution

The paper presents R3M, a novel pre-trained visual representation for robot manipulation, leveraging diverse human video data and multiple learning techniques, outperforming existing methods.

Findings

01

R3M improves task success by over 20% compared to training from scratch.

02

R3M outperforms state-of-the-art visual representations like CLIP and MoCo.

03

R3M enables real robot learning with just 20 demonstrations.

Abstract

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/r3m
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsInfoNCE · Batch Normalization · Momentum Contrast · Contrastive Language-Image Pre-training