CAVL: Learning Contrastive and Adaptive Representations of Vision and   Language

Shentong Mo; Jingfei Xia; Ihor Markevych

arXiv:2304.04399·cs.CV·April 11, 2023·1 cites

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Shentong Mo, Jingfei Xia, Ihor Markevych

PDF

Open Access

TL;DR

CAVL introduces a contrastive learning approach with adaptive fine-tuning networks to improve vision-language representations, achieving superior results across multiple tasks while significantly reducing computational costs.

Contribution

The paper proposes a novel contrastive pre-training method combined with lightweight adaptation networks for efficient vision-language learning.

Findings

01

Achieves state-of-the-art performance on six downstream tasks.

02

Reduces fine-tuning time by up to 76.17%.

03

Demonstrates the effectiveness of contrastive pre-training and adaptive fine-tuning.

Abstract

Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks. However, there exists semantic confusion between language and vision during the pre-training stage. Moreover, current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. In this work, we present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn alignments between the whole sentence and each image in the same batch during the pre-training process. At the fine-tuning stage, we introduce two lightweight adaptation networks to reduce model parameters and increase training speed for saving computation resources. We evaluate our CAVL on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings