# Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

**Authors:** Faizan Farooq Khan, Vladan Stojni\'c, Zakaria Laskar, Mohamed Elhoseiny, Giorgos Tolias

arXiv: 2509.00177 · 2025-09-03

## TL;DR

This paper presents a novel two-step method for category-level text-to-image retrieval that transforms text queries into visual representations using diffusion models, significantly improving retrieval accuracy over traditional VLM-based methods.

## Contribution

It introduces a diffusion-based text-to-visual transformation and an aggregation network to enhance cross-modal retrieval performance, bridging the modality gap effectively.

## Key findings

- Outperforms existing text-only retrieval methods
- Consistently improves retrieval accuracy across datasets
- Leverages diffusion models and vision encoders effectively

## Abstract

This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00177/full.md

## Figures

44 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00177/full.md

## References

81 references — full list in the complete paper: https://tomesphere.com/paper/2509.00177/full.md

---
Source: https://tomesphere.com/paper/2509.00177