VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

Di Wu; Yixin Wan; Kai-Wei Chang

arXiv:2505.20291·cs.CV·April 28, 2026

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

Di Wu, Yixin Wan, Kai-Wei Chang

PDF

2 Repos 1 Datasets

TL;DR

VisRet introduces a novel retrieval paradigm that enhances text-to-image retrieval by projecting textual queries into the image domain via T2I generation, significantly improving performance across multiple benchmarks.

Contribution

The paper presents VisRet, a new approach that mitigates cross-modal embedding limitations by transforming text queries into the image modality for more accurate retrieval.

Findings

01

Outperforms existing methods on four benchmarks with significant nDCG@30 improvements.

02

Increases downstream question answering accuracy by up to 15.7%.

03

Demonstrates compatibility with various T2I models and LLMs.

Abstract

Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and viewpoint. We proposeVisualize-then-Retrieve (VisRet), a retrieval paradigm that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation, then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

uclanlp/Visual-RAG-ME
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.