A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen,, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim, Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen,, Marcin Michalski, Olivier Bousquet, Sylvain Gelly

TL;DR
This paper introduces the Visual Task Adaptation Benchmark (VTAB) to evaluate the generalization of visual representations across diverse tasks, providing insights into the effectiveness of various learning algorithms and supervision methods.
Contribution
The paper presents VTAB, a comprehensive benchmark for assessing visual representations on diverse tasks, and conducts a large-scale study comparing different learning algorithms and supervision techniques.
Findings
ImageNet representations perform well beyond natural datasets
Generative and discriminative models show comparable effectiveness
Self-supervision can often replace labels effectively
Abstract
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, reconstruction error). We present the Visual Task Adaptation Benchmark (VTAB), which defines good representations as those that adapt to diverse, unseen tasks with few examples. With VTAB, we conduct a large-scale study of many popular publicly-available representation learning algorithms. We carefully control confounders such as architecture and tuning budget. We address questions like: How effective are ImageNet representations beyond standard natural datasets? How do representations trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗laion/CLIP-ViT-L-14-laion2B-s32B-b82Kmodel· 275k dl· ♡ 63275k dl♡ 63
- 🤗laion/CLIP-ViT-H-14-laion2B-s32B-b79Kmodel· 410k dl· ♡ 451410k dl♡ 451
- 🤗laion/CLIP-ViT-bigG-14-laion2B-39B-b160kmodel· 70k dl· ♡ 30870k dl♡ 308
- 🤗laion/CLIP-ViT-B-32-laion2B-s34B-b79Kmodel· 2.3M dl· ♡ 1382.3M dl♡ 138
- 🤗laion/CLIP-ViT-g-14-laion2B-s12B-b42Kmodel· 11k dl· ♡ 4411k dl♡ 44
- 🤗hoaiht/CLIP-ViT-H-14-laion2B-s32B-b79Kmodel· 341 dl· ♡ 2341 dl♡ 2
- 🤗laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32kmodel· 775 dl· ♡ 2775 dl♡ 2
- 🤗laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90kmodel· 52k dl· ♡ 1452k dl♡ 14
- 🤗laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90kmodel· 2.3k dl· ♡ 232.3k dl♡ 23
- 🤗lysandre/CLIP-ViT-L-14-laion2B-s32B-b82Kmodel· 39 dl39 dl
Videos
The Visual Task Adaptation Benchmark· youtube
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsAverage Pooling · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling
