Text encoders bottleneck compositionality in contrastive vision-language   models

Amita Kamath; Jack Hessel; Kai-Wei Chang

arXiv:2305.14897·cs.CL·October 31, 2023·1 cites

Text encoders bottleneck compositionality in contrastive vision-language models

Amita Kamath, Jack Hessel, Kai-Wei Chang

PDF

Open Access 1 Repo

TL;DR

This paper investigates how well contrastive vision-language models like CLIP capture compositional language information, revealing limitations in their encoding of complex, structured captions and proposing new benchmarks and analyses.

Contribution

The study introduces CompPrompts, a set of compositional captions, and a method to assess text encoding quality without images, highlighting the bottleneck in VL models' ability to represent compositional language.

Findings

01

CLIP's text encoder struggles with compositional inputs

02

Some text encoders outperform others in capturing compositionality

03

Text recoverability correlates with multi-modal matching performance

Abstract

Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amitakamath/vl_text_encoders_are_bottlenecks
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsTest · Contrastive Language-Image Pre-training