VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
Di Wu, Yixin Wan, Kai-Wei Chang

TL;DR
VisRet introduces a novel retrieval paradigm that enhances text-to-image retrieval by projecting textual queries into the image domain via T2I generation, significantly improving performance across multiple benchmarks.
Contribution
The paper presents VisRet, a new approach that mitigates cross-modal embedding limitations by transforming text queries into the image modality for more accurate retrieval.
Findings
Outperforms existing methods on four benchmarks with significant nDCG@30 improvements.
Increases downstream question answering accuracy by up to 15.7%.
Demonstrates compatibility with various T2I models and LLMs.
Abstract
Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and viewpoint. We proposeVisualize-then-Retrieve (VisRet), a retrieval paradigm that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation, then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
