CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu; Zirui Wang; Vijay Vasudevan; Legg Yeung; Mojtaba; Seyedhosseini; Yonghui Wu

arXiv:2205.01917·cs.CV·June 15, 2022·515 cites

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba, Seyedhosseini, Yonghui Wu

PDF

Open Access 5 Repos 5 Models 1 Datasets

TL;DR

CoCa introduces a unified image-text foundation model trained with contrastive and captioning losses, achieving state-of-the-art results across diverse vision and multimodal tasks without task-specific tuning.

Contribution

This paper presents CoCa, a novel minimalist encoder-decoder model that unifies contrastive and generative training for versatile image-text representations.

Findings

01

Achieves 86.3% zero-shot ImageNet accuracy

02

Sets new state-of-the-art 91.0% top-1 accuracy with finetuning

03

Performs well on a broad range of downstream tasks

Abstract

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

SNUMPR/DRAKE
dataset· 430 dl
430 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsContrastive Language-Image Pre-training · Simple Visual Language Model