ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Guanqi Zhan; Yuanpei Liu; Kai Han; Weidi Xie; Andrew Zisserman

arXiv:2502.15682·cs.CV·October 21, 2025

ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman

PDF

TL;DR

This paper introduces ELIP, a framework that enhances vision-language models for improved text-to-image retrieval and better out-of-distribution generalization, using visual prompts conditioned on text queries.

Contribution

ELIP is a novel method that boosts existing large-scale vision-language models for retrieval tasks by predicting visual prompts from text queries, adaptable to multiple architectures.

Findings

01

ELIP significantly improves retrieval performance of CLIP, SigLIP, and SigLIP-2.

02

ELIP outperforms BLIP-2 on several benchmarks.

03

ELIP enhances zero-shot generalization to OOD datasets.

Abstract

The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training · Sparse Evolutionary Training