In-Context Learning Improves Compositional Understanding of   Vision-Language Models

Matteo Nulli; Anesa Ibrahimi; Avik Pal; Hoshe Lee; Ivona Najdenkoska

arXiv:2407.15487·cs.CV·July 23, 2024

In-Context Learning Improves Compositional Understanding of Vision-Language Models

Matteo Nulli, Anesa Ibrahimi, Avik Pal, Hoshe Lee, Ivona Najdenkoska

PDF

Open Access 1 Repo

TL;DR

This paper investigates the limitations of vision-language models in compositional understanding and demonstrates that in-context learning significantly enhances their reasoning capabilities across various datasets.

Contribution

It provides a comprehensive benchmark analysis of VLMs' compositional understanding and introduces in-context learning as a method to improve their reasoning abilities.

Findings

01

In-context learning improves compositional understanding in VLMs.

02

Contrastive and generative models show different strengths in compositional tasks.

03

Proposed approach outperforms baseline models on multiple datasets.

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities in a large number of downstream tasks. Nonetheless, compositional image understanding remains a rather difficult task due to the object bias present in training data. In this work, we investigate the reasons for such a lack of capability by performing an extensive bench-marking of compositional understanding in VLMs. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Furthermore, we leverage In-Context Learning (ICL) as a way to improve the ability of VLMs to perform more complex reasoning and understanding given an image. Our extensive experiments demonstrate that our proposed approach outperforms baseline models across multiple compositional understanding datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hoezey/vlm-compositionality
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications