R+X: Retrieval and Execution from Everyday Human Videos

Georgios Papagiannis; Norman Di Palo; Pietro Vitiello; Edward Johns

arXiv:2407.12957·cs.RO·April 4, 2025·1 cites

R+X: Retrieval and Execution from Everyday Human Videos

Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, Edward Johns

PDF

Open Access

TL;DR

R+X is a novel framework that enables robots to learn and execute household skills directly from unlabelled human videos by combining retrieval via vision-language models and in-context imitation learning, eliminating manual annotations and training phases.

Contribution

It introduces a retrieval and execution framework that learns robot skills from unlabelled human videos using vision-language models and in-context learning, without manual annotations or training.

Findings

01

R+X successfully translates unlabelled human videos into robot skills.

02

R+X outperforms recent alternative methods in household tasks.

03

The framework enables immediate skill execution without training.

Abstract

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis