An Empirical Study of Autoregressive Pre-training from Videos
Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi, Gandelsman, Christoph Feichtenhofer, Jitendra Malik

TL;DR
This paper investigates autoregressive pre-training from videos using transformer models, demonstrating competitive performance across various tasks and revealing scaling behaviors similar to language models.
Contribution
It introduces Toto, a series of autoregressive video models trained on large datasets, and systematically explores architectural, training, and inference choices.
Findings
Autoregressive pre-training yields strong performance on multiple benchmarks.
Scaling video models follows similar patterns to language model scaling.
Minimal inductive biases do not hinder competitive results.
Abstract
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks
