CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Zixian Ma; Jerry Hong; Mustafa Omer Gul; Mona Gandhi; Irena Gao,; Ranjay Krishna

arXiv:2212.07796·cs.CL·May 17, 2023

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao,, Ranjay Krishna

PDF

Open Access 1 Repo

TL;DR

This paper introduces CREPE, a new benchmark to evaluate the compositional reasoning abilities of vision-language models, revealing significant struggles in systematicity and productivity across various architectures and datasets.

Contribution

The paper presents CREPE, the first comprehensive benchmark for assessing compositionality in vision-language models, highlighting their limitations in systematicity and productivity.

Findings

01

Models' performance drops with novel compositions, reducing Recall@1 by up to 12%.

02

Retrieval success declines as complexity increases, nearing random chance at high complexity.

03

Results are consistent across different models and training datasets.

Abstract

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370 K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325 K$ , $316 K$ , and $309 K$ hard negative captions for a subset of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raivnlab/crepe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Test · Adam · Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Warmup With Cosine Annealing · Softmax · Layer Normalization · Dropout