Semantic Compositions Enhance Vision-Language Contrastive Learning

Maxwell Aladago; Lorenzo Torresani; Soroush Vosoughi

arXiv:2407.01408·cs.CV·July 2, 2024

Semantic Compositions Enhance Vision-Language Contrastive Learning

Maxwell Aladago, Lorenzo Torresani, Soroush Vosoughi

PDF

Open Access

TL;DR

This paper introduces a simple yet effective method called CLIP-C that creates semantically composite image-caption pairs during pretraining, significantly enhancing zero-shot classification and retrieval in vision-language models without extra computational costs.

Contribution

The paper proposes a novel data augmentation technique for vision-language contrastive learning that improves model performance by creating composite examples inspired by CutMix.

Findings

01

Significant improvement in zero-shot classification accuracy.

02

Enhanced cross-modal retrieval performance.

03

Most beneficial in limited data scenarios.

Abstract

In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsContrastive Language-Image Pre-training · CutMix