Text encoders bottleneck compositionality in contrastive vision-language models
Amita Kamath, Jack Hessel, Kai-Wei Chang

TL;DR
This paper investigates how well contrastive vision-language models like CLIP capture compositional language information, revealing limitations in their encoding of complex, structured captions and proposing new benchmarks and analyses.
Contribution
The study introduces CompPrompts, a set of compositional captions, and a method to assess text encoding quality without images, highlighting the bottleneck in VL models' ability to represent compositional language.
Findings
CLIP's text encoder struggles with compositional inputs
Some text encoders outperform others in capturing compositionality
Text recoverability correlates with multi-modal matching performance
Abstract
Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsTest · Contrastive Language-Image Pre-training
