Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online   Videos

Bowen Baker; Ilge Akkaya; Peter Zhokhov; Joost Huizinga; Jie Tang,; Adrien Ecoffet; Brandon Houghton; Raul Sampedro; Jeff Clune

arXiv:2206.11795·cs.LG·June 24, 2022·50 cites

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang,, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, Jeff Clune

PDF

Open Access 2 Repos 1 Models 2 Datasets 2 Videos

TL;DR

This paper introduces Video PreTraining (VPT), a method for learning behavioral priors from unlabeled online videos using semi-supervised imitation learning, enabling agents to perform complex tasks with minimal labeled data.

Contribution

The paper presents a novel approach to pretraining decision-making models from unlabeled videos by combining inverse dynamics modeling with semi-supervised learning, achieving human-level performance in complex tasks.

Findings

01

Behavioral priors trained on unlabeled videos show zero-shot capabilities.

02

Models can be fine-tuned with imitation and reinforcement learning for hard tasks.

03

First agents to craft diamond tools in Minecraft, matching human proficiency.

Abstract

Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
mancub/NitroGen
model· ♡ 1
♡ 1

Datasets

Videos

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos (Paper Explained)· youtube

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis