CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao,, Ranjay Krishna

TL;DR
This paper introduces CREPE, a new benchmark to evaluate the compositional reasoning abilities of vision-language models, revealing significant struggles in systematicity and productivity across various architectures and datasets.
Contribution
The paper presents CREPE, the first comprehensive benchmark for assessing compositionality in vision-language models, highlighting their limitations in systematicity and productivity.
Findings
Models' performance drops with novel compositions, reducing Recall@1 by up to 12%.
Retrieval success declines as complexity increases, nearing random chance at high complexity.
Results are consistent across different models and training datasets.
Abstract
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate , , and hard negative captions for a subset of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Test · Adam · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Warmup With Cosine Annealing · Softmax · Layer Normalization · Dropout
