Multimodal Autoregressive Pre-training of Large Vision Encoders
Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein,, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis, B\'ethune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei, Yang, Joshua M. Susskind, Alaaeldin El-Nouby

TL;DR
This paper introduces AIMV2, a scalable multimodal vision encoder trained autoregressively on images and text, achieving state-of-the-art results in vision and multimodal tasks with a simple pre-training process.
Contribution
The paper presents AIMV2, a novel multimodal vision encoder that extends autoregressive pre-training to handle both images and text, demonstrating superior performance across various benchmarks.
Findings
AIMV2-3B achieves 89.5% accuracy on ImageNet-1k.
AIMV2 outperforms contrastive models like CLIP in multimodal understanding.
The method is scalable and effective across multiple downstream tasks.
Abstract
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗apple/aimv2-large-patch14-224-litmodel· 3.6k dl· ♡ 73.6k dl♡ 7
- 🤗apple/aimv2-large-patch14-224model· 1.5k dl· ♡ 621.5k dl♡ 62
- 🤗apple/aimv2-huge-patch14-224model· 43 dl· ♡ 1343 dl♡ 13
- 🤗apple/aimv2-1B-patch14-224model· 170 dl· ♡ 8170 dl♡ 8
- 🤗apple/aimv2-3B-patch14-224model· 29 dl· ♡ 429 dl♡ 4
- 🤗apple/aimv2-large-patch14-336model· 68 dl· ♡ 568 dl♡ 5
- 🤗apple/aimv2-huge-patch14-336model· 392 dl392 dl
- 🤗apple/aimv2-1B-patch14-336model· 6 dl6 dl
- 🤗apple/aimv2-3B-patch14-336model· 17 dl· ♡ 517 dl♡ 5
- 🤗apple/aimv2-large-patch14-448model· 27 dl· ♡ 727 dl♡ 7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsContrastive Language-Image Pre-training
