A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

Bulat Khaertdinov; Mirela Popa; Nava Tintarev

arXiv:2511.17255·cs.CV·November 24, 2025

A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

Bulat Khaertdinov, Mirela Popa, Nava Tintarev

PDF

Open Access

TL;DR

This paper introduces relevance feedback mechanisms for vision-language models to improve text-to-image retrieval performance without fine-tuning, demonstrating consistent gains and robustness across different models and settings.

Contribution

It proposes four relevance feedback strategies, including generative and attentive feedback, to enhance retrieval accuracy and robustness in vision-language models at inference time.

Findings

01

Relevance feedback improves retrieval by 3-5% in MRR@5 for smaller VLMs.

02

The methods are more robust than traditional pseudo-relevance feedback in multi-turn retrieval.

03

Relevance feedback enables interactive, adaptive visual search without additional fine-tuning.

Abstract

Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Information Retrieval and Search Behavior