BiVLC: Extending Vision-Language Compositionality Evaluation with   Text-to-Image Retrieval

Imanol Miranda; Ander Salaberria; Eneko Agirre; Gorka Azkune

arXiv:2406.09952·cs.CV·November 5, 2024

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune

PDF

Open Access 1 Repo 5 Models 4 Datasets 1 Video

TL;DR

This paper introduces BiVLC, a new benchmark dataset for evaluating vision-language compositionality in both image-to-text and text-to-image retrieval, revealing current models' weaknesses and proposing improvements.

Contribution

The paper presents BiVLC, a novel bidirectional benchmark with synthetic hard negatives, and demonstrates that contrastive training enhances model performance across retrieval directions.

Findings

01

Models perform poorly in text-to-image retrieval.

02

Previous conclusions change when considering both retrieval directions.

03

Contrastive training with synthetic data improves performance.

Abstract

Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work, we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imirandam/bivlc
pytorchOfficial

Models

Datasets

Videos

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsBalanced Selection