Scalable Pre-training of Large Autoregressive Image Models

Alaaeldin El-Nouby; Michal Klein; Shuangfei Zhai; Miguel Angel; Bautista; Alexander Toshev; Vaishaal Shankar; Joshua M Susskind; Armand; Joulin

arXiv:2401.08541·cs.CV·January 17, 2024·6 cites

Scalable Pre-training of Large Autoregressive Image Models

Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel, Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand, Joulin

PDF

Open Access 2 Repos 4 Models

TL;DR

This paper presents AIM, a scalable autoregressive vision model inspired by large language models, demonstrating that larger models trained on more data improve performance without saturation, and highlighting the correlation between objective value and downstream task success.

Contribution

Introduces AIM, a large-scale autoregressive image model trained without image-specific stabilization, showing scaling laws similar to language models and potential for further growth.

Findings

01

Model performance scales with capacity and data size.

02

Objective function value correlates with downstream task performance.

03

Achieved 84.0% on ImageNet-1k with a 7 billion parameter model.

Abstract

This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, that achieves 84.0% on ImageNet-1k with a frozen trunk. Interestingly, even at this scale, we observe no sign of saturation in performance, suggesting that AIM potentially represents a new frontier for training large-scale vision models. The pre-training of AIM is similar to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI

MethodsLinear Layer · Softmax · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer