When and why vision-language models behave like bags-of-words, and what   to do about it?

Mert Yuksekgonul; Federico Bianchi; Pratyusha Kalluri; Dan Jurafsky,; James Zou

arXiv:2210.01936·cs.CV·March 27, 2023·37 cites

When and why vision-language models behave like bags-of-words, and what to do about it?

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky,, James Zou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a large benchmark to evaluate vision-language models' understanding of relationships, attributes, and order, revealing their deficiencies and proposing a simple training modification to improve compositional understanding.

Contribution

The creation of the comprehensive ARO benchmark and the demonstration that contrastive learning can be enhanced with composition-aware hard negative mining.

Findings

01

VLMs show poor relational and order understanding.

02

Training datasets do not sufficiently teach compositionality.

03

Simple modifications to contrastive learning improve compositional performance.

Abstract

Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mertyg/vision-language-models-are-bows
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsTest · Contrastive Language-Image Pre-training · Contrastive Learning