The Hard Positive Truth about Vision-Language Compositionality
Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, Ranjay Krishna

TL;DR
This paper critically evaluates vision-language models' compositionality, revealing that current benchmarks and finetuning methods overstate improvements, and proposes a new training approach with hard positives and negatives for better robustness.
Contribution
It uncovers the overestimation of compositionality improvements and introduces a large-scale training dataset with hard positives and negatives to enhance model robustness.
Findings
Including hard positives decreases CLIP's performance by 12.9%.
Humans perform at 99% accuracy on the same task.
Training with both hard positives and negatives improves robustness and benchmark performance.
Abstract
Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated -- because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHistorical and Linguistic Studies
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
