Image Retrieval from Contextual Descriptions

Benno Krojer; Vaibhav Adlakha; Vibhav Vineet; Yash Goyal; Edoardo; Ponti; Siva Reddy

arXiv:2203.15867·cs.CV·November 21, 2022

Image Retrieval from Contextual Descriptions

Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, Yash Goyal, Edoardo, Ponti, Siva Reddy

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces ImageCoDe, a challenging multimodal benchmark for evaluating vision-and-language models' ability to use contextual cues for image retrieval, revealing significant gaps compared to human performance.

Contribution

It presents a new complex challenge, ImageCoDe, to assess and improve models' capacity for grounded language understanding using visual and temporal context.

Findings

01

State-of-the-art models perform significantly worse than humans on ImageCoDe.

02

Models achieve up to 59.4% accuracy on static images, far below human accuracy of 90.8%.

03

Proposed model variants show modest improvements by better incorporating context.

Abstract

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. As such, each description contains only the details that help distinguish between images. Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames. We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mcgill-nlp/imagecode
pytorchOfficial

Datasets

BennoKrojer/ImageCoDe
dataset· 170 dl
170 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques

MethodsVision-and-Language BERT · Contrastive Language-Image Pre-training