Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang,, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, Jeff Clune

TL;DR
This paper introduces Video PreTraining (VPT), a method for learning behavioral priors from unlabeled online videos using semi-supervised imitation learning, enabling agents to perform complex tasks with minimal labeled data.
Contribution
The paper presents a novel approach to pretraining decision-making models from unlabeled videos by combining inverse dynamics modeling with semi-supervised learning, achieving human-level performance in complex tasks.
Findings
Behavioral priors trained on unlabeled videos show zero-shot capabilities.
Models can be fine-tuned with imitation and reinforcement learning for hard tasks.
First agents to craft diamond tools in Minecraft, matching human proficiency.
Abstract
Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis
