Humanoid Locomotion as Next Token Prediction
Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran,, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik

TL;DR
This paper introduces a novel approach to humanoid control by framing it as a next token prediction task using a causal transformer, enabling zero-shot walking and generalization to unseen commands.
Contribution
It presents a new formulation of humanoid control as next token prediction, allowing transfer from simulation to real-world with limited data and handling missing modalities.
Findings
Enables humanoid to walk in San Francisco zero-shot
Transfers to real-world with only 27 hours of data
Generalizes to unseen commands like walking backward
Abstract
We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, like video trajectories without actions. We train our model on a collection of simulated trajectories coming from prior neural network policies, model-based controllers, motion capture data, and YouTube videos of humans. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition
