Compositional Image-Text Matching and Retrieval by Grounding Entities

Madhukar Reddy Vongala; Saurabh Srivastava; Jana Ko\v{s}eck\'a

arXiv:2505.02278·cs.CV·May 6, 2025

Compositional Image-Text Matching and Retrieval by Grounding Entities

Madhukar Reddy Vongala, Saurabh Srivastava, Jana Ko\v{s}eck\'a

PDF

Open Access

TL;DR

This paper introduces a zero-shot method to improve CLIP's ability to perform entity grounding and compositional image-text matching by augmenting embeddings with localized sub-image information, leading to better retrieval accuracy.

Contribution

The work proposes a novel, learning-free augmentation technique for CLIP embeddings that enhances compositional matching and grounding capabilities without additional training.

Findings

01

Achieved 1.5% improvement in image-text matching accuracy on Visual Genome and SVO datasets.

02

Significant 12% increase in Recall@1 on Flickr30K retrieval benchmark.

03

Enhanced embeddings outperform baseline CLIP in zero-shot image-text retrieval tasks.

Abstract

Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform entity grounding and compositional image and text matching~\cite{Jiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR, learninglocalizeCVPR24}. In this work we propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties. We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors and dynamically adjust the baseline global image embedding. % The final…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training