ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

TL;DR
This paper introduces ViTamin, a new vision model designed for vision-language models, demonstrating superior zero-shot performance and scalability compared to traditional ViTs within the CLIP framework.
Contribution
Proposes ViTamin, a scalable vision model optimized for vision-language tasks, with extensive benchmarking showing significant improvements over standard ViTs.
Findings
ViTamin-L improves ImageNet zero-shot accuracy by 2.0% over ViT-L.
ViTamin-XL achieves 82.9% ImageNet zero-shot accuracy with 436M parameters.
ViTamin outperforms larger models like EVA-E on multiple benchmarks.
Abstract
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jienengchen/ViTamin-XL-384pxmodel· 23 dl· ♡ 2023 dl♡ 20
- 🤗jienengchen/ViTamin-L-336pxmodel· 11 dl· ♡ 411 dl♡ 4
- 🤗jienengchen/ViTamin-B-LTTmodel· 7 dl7 dl
- 🤗jienengchen/ViTamin-Bmodel· 9 dl9 dl
- 🤗jienengchen/ViTamin-Smodel· 13 dl13 dl
- 🤗jienengchen/ViTamin-S-LTTmodel· 11 dl11 dl
- 🤗jienengchen/ViTamin-L2-384pxmodel· 14 dl14 dl
- 🤗jienengchen/ViTamin-L2-336pxmodel· 12 dl12 dl
- 🤗jienengchen/ViTamin-L2-256pxmodel· 9 dl9 dl
- 🤗jienengchen/ViTamin-XL-256pxmodel· 9 dl· ♡ 19 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Geographic Information Systems Studies
