EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao

TL;DR
EVA-CLIP introduces new training techniques that significantly enhance the efficiency and performance of CLIP models, achieving high accuracy with fewer resources and enabling broader accessibility for research.
Contribution
The paper presents EVA-CLIP, a set of improved training methods that boost CLIP's effectiveness and efficiency, with state-of-the-art results at scale.
Findings
Achieves 82.0% zero-shot top-1 accuracy on ImageNet-1K with 5.0B parameters.
Smaller model attains 80.4% accuracy with fewer parameters and samples.
Significantly reduces training costs while maintaining high performance.
Abstract
Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗QuanSun/EVA-CLIPmodel· ♡ 114♡ 114
- 🤗timm/eva_giant_patch14_224.clip_ft_in1kmodel· 216 dl· ♡ 2216 dl♡ 2
- 🤗timm/eva_giant_patch14_336.clip_ft_in1kmodel· 113 dl· ♡ 1113 dl♡ 1
- 🤗timm/eva02_base_patch14_224.mim_in22kmodel· 9.1k dl· ♡ 69.1k dl♡ 6
- 🤗timm/eva02_base_patch14_448.mim_in22k_ft_in1kmodel· 2.7k dl· ♡ 42.7k dl♡ 4
- 🤗timm/eva02_base_patch14_448.mim_in22k_ft_in22kmodel· 278 dl· ♡ 1278 dl♡ 1
- 🤗timm/eva02_base_patch14_448.mim_in22k_ft_in22k_in1kmodel· 5.3k dl· ♡ 85.3k dl♡ 8
- 🤗timm/eva02_large_patch14_224.mim_in22kmodel· 806 dl· ♡ 2806 dl♡ 2
- 🤗timm/eva02_large_patch14_224.mim_m38mmodel· 259 dl259 dl
- 🤗timm/eva02_large_patch14_448.mim_in22k_ft_in1kmodel· 1.0k dl1.0k dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research
MethodsContrastive Language-Image Pre-training
