How Much Can CLIP Benefit Vision-and-Language Tasks?

Sheng Shen; Liunian Harold Li; Hao Tan; Mohit Bansal; Anna Rohrbach,; Kai-Wei Chang; Zhewei Yao; Kurt Keutzer

arXiv:2107.06383·cs.CV·July 15, 2021·153 cites

How Much Can CLIP Benefit Vision-and-Language Tasks?

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach,, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper investigates the benefits of using CLIP, a large-scale pre-trained vision-and-language model, as a visual encoder in various V&L tasks, demonstrating significant performance improvements and state-of-the-art results.

Contribution

It systematically evaluates CLIP's integration into V&L models, showing its superiority over traditional visual encoders across multiple tasks and scenarios.

Findings

01

CLIP outperforms traditional visual encoders like BottomUp-TopDown.

02

Achieves new state-of-the-art on VQA, Visual Entailment, and V&L Navigation.

03

Significantly improves zero-shot and transfer learning performance.

Abstract

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

How Much Can CLIP Benefit Vision-and-Language Tasks?· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems