CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba, Seyedhosseini, Yonghui Wu

TL;DR
CoCa introduces a unified image-text foundation model trained with contrastive and captioning losses, achieving state-of-the-art results across diverse vision and multimodal tasks without task-specific tuning.
Contribution
This paper presents CoCa, a novel minimalist encoder-decoder model that unifies contrastive and generative training for versatile image-text representations.
Findings
Achieves 86.3% zero-shot ImageNet accuracy
Sets new state-of-the-art 91.0% top-1 accuracy with finetuning
Performs well on a broad range of downstream tasks
Abstract
Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/magenta-realtimemodel· 261 dl· ♡ 545261 dl♡ 545
- 🤗google/videoprism-base-f16r288model· 17k dl· ♡ 9817k dl♡ 98
- 🤗google/videoprism-large-f8r288model· 813 dl· ♡ 18813 dl♡ 18
- 🤗google/videoprism-lvt-base-f16r288model· 17k dl· ♡ 1117k dl♡ 11
- 🤗google/videoprism-lvt-large-f8r288model· 2.5k dl· ♡ 152.5k dl♡ 15
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContrastive Language-Image Pre-training · Simple Visual Language Model
