Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani; Josip Djolonga; Basil Mustafa; Piotr Padlewski,; Jonathan Heek; Justin Gilmer; Andreas Steiner; Mathilde Caron; Robert; Geirhos; Ibrahim Alabdulmohsin; Rodolphe Jenatton; Lucas Beyer; Michael; Tschannen; Anurag Arnab; Xiao Wang; Carlos Riquelme; Matthias Minderer; Joan; Puigcerver; Utku Evci; Manoj Kumar; Sjoerd van Steenkiste; Gamaleldin F.; Elsayed; Aravindh Mahendran; Fisher Yu; Avital Oliver; Fantine Huot; Jasmijn; Bastings; Mark Patrick Collier; Alexey Gritsenko; Vighnesh Birodkar; Cristina; Vasconcelos; Yi Tay; Thomas Mensink; Alexander Kolesnikov; Filip Paveti\'c,; Dustin Tran; Thomas Kipf; Mario Lu\v{c}i\'c; Xiaohua Zhai; Daniel Keysers,; Jeremiah Harmsen; Neil Houlsby

arXiv:2302.05442·cs.CV·February 13, 2023·118 cites

Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski,, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert, Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael, Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme

PDF

Open Access 1 Repo 8 Models 1 Video

TL;DR

This paper introduces a highly efficient method for training a 22-billion-parameter Vision Transformer, demonstrating improved performance, fairness, robustness, and alignment with human perception, marking a significant step towards scaling vision models.

Contribution

We present a novel recipe for training extremely large Vision Transformers, achieving 22B parameters and demonstrating benefits similar to large language models in vision tasks.

Findings

01

ViT-22B shows improved performance with scale

02

Enhanced fairness and robustness observed

03

Achieves state-of-the-art shape/texture bias alignment

Abstract

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucidrains/flash-cosine-sim-attention
pytorch

Models

Videos

Scaling Vision Transformers to 22 Billion Parameters· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications