ViTamin: Designing Scalable Vision Models in the Vision-Language Era

Jieneng Chen; Qihang Yu; Xiaohui Shen; Alan Yuille; Liang-Chieh Chen

arXiv:2404.02132·cs.CV·April 5, 2024·1 cites

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

PDF

Open Access 2 Repos 10 Models

TL;DR

This paper introduces ViTamin, a new vision model designed for vision-language models, demonstrating superior zero-shot performance and scalability compared to traditional ViTs within the CLIP framework.

Contribution

Proposes ViTamin, a scalable vision model optimized for vision-language tasks, with extensive benchmarking showing significant improvements over standard ViTs.

Findings

01

ViTamin-L improves ImageNet zero-shot accuracy by 2.0% over ViT-L.

02

ViTamin-XL achieves 82.9% ImageNet zero-shot accuracy with 436M parameters.

03

ViTamin outperforms larger models like EVA-E on multiple benchmarks.

Abstract

Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Geographic Information Systems Studies