TL;DR
This paper introduces a large-scale video pretraining approach for robot control, enabling zero-shot planning and execution in diverse real-world tasks, with open-sourced models and datasets.
Contribution
It pioneers the use of large-scale video pretraining as the primary modality for robot foundation models, demonstrating strong generalization and real-world applicability.
Findings
Zero-shot video plans enable successful robot task execution.
The model generalizes across diverse scenes and tasks.
Open dataset and model support reproducibility and further research.
Abstract
General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
