Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala, Saurabh Srivastava, Jana Ko\v{s}eck\'a

TL;DR
This paper introduces a zero-shot method to improve CLIP's ability to perform entity grounding and compositional image-text matching by augmenting embeddings with localized sub-image information, leading to better retrieval accuracy.
Contribution
The work proposes a novel, learning-free augmentation technique for CLIP embeddings that enhances compositional matching and grounding capabilities without additional training.
Findings
Achieved 1.5% improvement in image-text matching accuracy on Visual Genome and SVO datasets.
Significant 12% increase in Recall@1 on Flickr30K retrieval benchmark.
Enhanced embeddings outperform baseline CLIP in zero-shot image-text retrieval tasks.
Abstract
Vision-language pretraining on large datasets of images-text pairs is one of the main building blocks of current Vision-Language Models. While with additional training, these models excel in various downstream tasks, including visual question answering, image captioning, and visual commonsense reasoning. However, a notable weakness of pretrained models like CLIP, is their inability to perform entity grounding and compositional image and text matching~\cite{Jiang2024ComCLIP, yang2023amc, Rajabi2023GroundedVSR, learninglocalizeCVPR24}. In this work we propose a novel learning-free zero-shot augmentation of CLIP embeddings that has favorable compositional properties. We compute separate embeddings of sub-images of object entities and relations that are localized by the state of the art open vocabulary detectors and dynamically adjust the baseline global image embedding. % The final…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
