An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran; Ilija Radosavovic; Rahul Ravishankar; Yossi; Gandelsman; Christoph Feichtenhofer; Jitendra Malik

arXiv:2501.05453·cs.CV·January 10, 2025

An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi, Gandelsman, Christoph Feichtenhofer, Jitendra Malik

PDF

Open Access

TL;DR

This paper investigates autoregressive pre-training from videos using transformer models, demonstrating competitive performance across various tasks and revealing scaling behaviors similar to language models.

Contribution

It introduces Toto, a series of autoregressive video models trained on large datasets, and systematically explores architectural, training, and inference choices.

Findings

01

Autoregressive pre-training yields strong performance on multiple benchmarks.

02

Scaling video models follows similar patterns to language model scaling.

03

Minimal inductive biases do not hinder competitive results.

Abstract

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks