Scaling Vision Transformers to 22 Billion Parameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski,, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert, Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael, Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme

TL;DR
This paper introduces a highly efficient method for training a 22-billion-parameter Vision Transformer, demonstrating improved performance, fairness, robustness, and alignment with human perception, marking a significant step towards scaling vision models.
Contribution
We present a novel recipe for training extremely large Vision Transformers, achieving 22B parameters and demonstrating benefits similar to large language models in vision tasks.
Findings
ViT-22B shows improved performance with scale
Enhanced fairness and robustness observed
Achieves state-of-the-art shape/texture bias alignment
Abstract
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HuggingFaceM4/idefics-80bmodel· 331 dl· ♡ 69331 dl♡ 69
- 🤗HuggingFaceM4/idefics-9bmodel· 1.9k dl· ♡ 471.9k dl♡ 47
- 🤗HuggingFaceM4/idefics-9b-instructmodel· 1.2k dl· ♡ 1071.2k dl♡ 107
- 🤗HuggingFaceM4/idefics-80b-instructmodel· 5.3k dl· ♡ 1895.3k dl♡ 189
- 🤗areegtarek/idefics-9b-instruct-allmodel· 12 dl12 dl
- 🤗stabilityai/stablelm-2-12bmodel· 2.9k dl· ♡ 1202.9k dl♡ 120
- 🤗RichardErkhov/stabilityai_-_stablelm-2-12b-4bitsmodel· 2 dl2 dl
- 🤗RichardErkhov/stabilityai_-_stablelm-2-12b-ggufmodel· 169 dl169 dl
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
