The Hard Positive Truth about Vision-Language Compositionality

Amita Kamath; Cheng-Yu Hsieh; Kai-Wei Chang; Ranjay Krishna

arXiv:2409.17958·cs.CL·September 27, 2024

The Hard Positive Truth about Vision-Language Compositionality

Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, Ranjay Krishna

PDF

Open Access 1 Repo

TL;DR

This paper critically evaluates vision-language models' compositionality, revealing that current benchmarks and finetuning methods overstate improvements, and proposes a new training approach with hard positives and negatives for better robustness.

Contribution

It uncovers the overestimation of compositionality improvements and introduces a large-scale training dataset with hard positives and negatives to enhance model robustness.

Findings

01

Including hard positives decreases CLIP's performance by 12.9%.

02

Humans perform at 99% accuracy on the same task.

03

Training with both hard positives and negatives improves robustness and benchmark performance.

Abstract

Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated -- because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amitakamath/hard_positives
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHistorical and Linguistic Studies

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training