TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic   Vision-Language Negatives

Maitreya Patel; Abhiram Kusumba; Sheng Cheng; Changhoon Kim; Tejas; Gokhale; Chitta Baral; Yezhou Yang

arXiv:2411.02545·cs.CV·November 6, 2024

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas, Gokhale, Chitta Baral, Yezhou Yang

PDF

Open Access 1 Video

TL;DR

TripletCLIP enhances CLIP's compositional reasoning by generating synthetic hard negatives, significantly improving performance on compositional benchmarks and zero-shot tasks.

Contribution

The paper introduces a novel contrastive pre-training strategy using synthetic negative images and captions to improve CLIP's compositional reasoning abilities.

Findings

01

Over 9% improvement on SugarCrepe benchmark

02

Enhanced zero-shot image classification performance

03

Improved image retrieval results

Abstract

Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Language-Image Pre-training