When and why vision-language models behave like bags-of-words, and what to do about it?
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky,, James Zou

TL;DR
This paper introduces a large benchmark to evaluate vision-language models' understanding of relationships, attributes, and order, revealing their deficiencies and proposing a simple training modification to improve compositional understanding.
Contribution
The creation of the comprehensive ARO benchmark and the demonstration that contrastive learning can be enhanced with composition-aware hard negative mining.
Findings
VLMs show poor relational and order understanding.
Training datasets do not sufficiently teach compositionality.
Simple modifications to contrastive learning improve compositional performance.
Abstract
Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsTest · Contrastive Language-Image Pre-training · Contrastive Learning
