EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang,, Tiejun Huang, Xinlong Wang, Yue Cao

TL;DR
EVA is a large-scale vision foundation model trained solely on publicly available data, achieving state-of-the-art results across various vision tasks and serving as a bridge to multi-modal models, with scalable architecture and improved training stability.
Contribution
The paper introduces EVA, a scalable vision foundation model trained with a simple masked image modeling task, setting new benchmarks and enabling efficient multi-modal training.
Findings
EVA achieves top performance on image recognition, detection, and segmentation tasks.
Scaling EVA to one billion parameters improves transfer learning capabilities.
Initializing CLIP with EVA enhances training stability and reduces data requirements.
Abstract
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BAAI/EVAmodel· ♡ 30♡ 30
- 🤗timm/eva_large_patch14_196.in22k_ft_in1kmodel· 97 dl97 dl
- 🤗timm/eva_large_patch14_196.in22k_ft_in22k_in1kmodel· 8.9k dl· ♡ 38.9k dl♡ 3
- 🤗timm/eva_large_patch14_336.in22k_ft_in1kmodel· 75 dl75 dl
- 🤗timm/eva_large_patch14_336.in22k_ft_in22k_in1kmodel· 1.2k dl· ♡ 11.2k dl♡ 1
- 🤗timm/eva_giant_patch14_336.m30m_ft_in22k_in1kmodel· 240 dl240 dl
- 🤗timm/eva_giant_patch14_560.m30m_ft_in22k_in1kmodel· 345 dl· ♡ 3345 dl♡ 3
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsContrastive Language-Image Pre-training
