Humanoid Locomotion as Next Token Prediction

Ilija Radosavovic; Bike Zhang; Baifeng Shi; Jathushan Rajasegaran,; Sarthak Kamat; Trevor Darrell; Koushil Sreenath; Jitendra Malik

arXiv:2402.19469·cs.RO·March 1, 2024·5 cites

Humanoid Locomotion as Next Token Prediction

Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran,, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel approach to humanoid control by framing it as a next token prediction task using a causal transformer, enabling zero-shot walking and generalization to unseen commands.

Contribution

It presents a new formulation of humanoid control as next token prediction, allowing transfer from simulation to real-world with limited data and handling missing modalities.

Findings

01

Enables humanoid to walk in San Francisco zero-shot

02

Transfers to real-world with only 27 hours of data

03

Generalizes to unseen commands like walking backward

Abstract

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Humanoid Locomotion as Next Token Prediction· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition