Scalable Pre-training of Large Autoregressive Image Models
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel, Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand, Joulin

TL;DR
This paper presents AIM, a scalable autoregressive vision model inspired by large language models, demonstrating that larger models trained on more data improve performance without saturation, and highlighting the correlation between objective value and downstream task success.
Contribution
Introduces AIM, a large-scale autoregressive image model trained without image-specific stabilization, showing scaling laws similar to language models and potential for further growth.
Findings
Model performance scales with capacity and data size.
Objective function value correlates with downstream task performance.
Achieved 84.0% on ImageNet-1k with a 7 billion parameter model.
Abstract
This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, that achieves 84.0% on ImageNet-1k with a frozen trunk. Interestingly, even at this scale, we observe no sign of saturation in performance, suggesting that AIM potentially represents a new frontier for training large-scale vision models. The pre-training of AIM is similar to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
MethodsLinear Layer · Softmax · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer
