CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo, Jingfei Xia, Ihor Markevych

TL;DR
CAVL introduces a contrastive learning approach with adaptive fine-tuning networks to improve vision-language representations, achieving superior results across multiple tasks while significantly reducing computational costs.
Contribution
The paper proposes a novel contrastive pre-training method combined with lightweight adaptation networks for efficient vision-language learning.
Findings
Achieves state-of-the-art performance on six downstream tasks.
Reduces fine-tuning time by up to 76.17%.
Demonstrates the effectiveness of contrastive pre-training and adaptive fine-tuning.
Abstract
Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks. However, there exists semantic confusion between language and vision during the pre-training stage. Moreover, current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. In this work, we present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL. Specifically, we introduce a pair-wise contrastive loss to learn alignments between the whole sentence and each image in the same batch during the pre-training process. At the fine-tuning stage, we introduce two lightweight adaptation networks to reduce model parameters and increase training speed for saving computation resources. We evaluate our CAVL on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
