Combined Scaling for Zero-shot Transfer Learning
Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu,, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing, Tan, Quoc V. Le

TL;DR
The paper introduces BASIC, a combined scaling method for contrastive image-text models, achieving state-of-the-art zero-shot ImageNet accuracy and robustness by scaling data, model size, and batch size, while addressing memory challenges and theoretical benefits of large batch sizes.
Contribution
BASIC is the first to systematically scale contrastive image-text models across three dimensions, demonstrating improved accuracy and robustness without labeled data.
Findings
Achieves 85.7% top-1 accuracy on ImageNet without labeled data.
Large contrastive batch sizes reduce generalization gaps.
Overcomes memory limitations with gradient checkpointing and model parallelism.
Abstract
We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
MethodsGradient Checkpointing · Contrastive Learning · ALIGN · Contrastive Language-Image Pre-training
