Image Retrieval from Contextual Descriptions
Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, Yash Goyal, Edoardo, Ponti, Siva Reddy

TL;DR
This paper introduces ImageCoDe, a challenging multimodal benchmark for evaluating vision-and-language models' ability to use contextual cues for image retrieval, revealing significant gaps compared to human performance.
Contribution
It presents a new complex challenge, ImageCoDe, to assess and improve models' capacity for grounded language understanding using visual and temporal context.
Findings
State-of-the-art models perform significantly worse than humans on ImageCoDe.
Models achieve up to 59.4% accuracy on static images, far below human accuracy of 90.8%.
Proposed model variants show modest improvements by better incorporating context.
Abstract
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. As such, each description contains only the details that help distinguish between images. Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames. We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
MethodsVision-and-Language BERT · Contrastive Language-Image Pre-training
