EVA: Exploring the Limits of Masked Visual Representation Learning at   Scale

Yuxin Fang; Wen Wang; Binhui Xie; Quan Sun; Ledell Wu; Xinggang Wang,; Tiejun Huang; Xinlong Wang; Yue Cao

arXiv:2211.07636·cs.CV·December 6, 2022·23 cites

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang,, Tiejun Huang, Xinlong Wang, Yue Cao

PDF

Open Access 5 Repos 8 Models

TL;DR

EVA is a large-scale vision foundation model trained solely on publicly available data, achieving state-of-the-art results across various vision tasks and serving as a bridge to multi-modal models, with scalable architecture and improved training stability.

Contribution

The paper introduces EVA, a scalable vision foundation model trained with a simple masked image modeling task, setting new benchmarks and enabling efficient multi-modal training.

Findings

01

EVA achieves top performance on image recognition, detection, and segmentation tasks.

02

Scaling EVA to one billion parameters improves transfer learning capabilities.

03

Initializing CLIP with EVA enhances training stability and reduces data requirements.

Abstract

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsContrastive Language-Image Pre-training