Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries

Yin Wu; Quanyu Long; Jing Li; Jianfei Yu; Wenya Wang

arXiv:2502.16636·cs.CL·August 18, 2025

Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries

Yin Wu, Quanyu Long, Jing Li, Jianfei Yu, Wenya Wang

PDF

Open Access 1 Repo

TL;DR

Visual-RAG introduces a benchmark for evaluating how effectively multimodal language models utilize retrieved images for visually grounded, knowledge-intensive question answering, revealing current limitations and areas for improvement.

Contribution

We present Visual-RAG, a novel benchmark that isolates and measures the contribution of retrieved images in multimodal RAG systems for visual knowledge questions.

Findings

01

Images significantly aid answer generation in Visual-RAG.

02

State-of-the-art models struggle to effectively utilize visual evidence.

03

Current models need better visual retrieval and grounding mechanisms.

Abstract

Retrieval-augmented generation (RAG) is a paradigm that augments large language models (LLMs) with external knowledge to tackle knowledge-intensive question answering. While several benchmarks evaluate Multimodal LLMs (MLLMs) under Multimodal RAG settings, they predominantly retrieve from textual corpora and do not explicitly assess how models exploit visual evidence during generation. Consequently, there still lacks benchmark that isolates and measures the contribution of retrieved images in RAG. We introduce Visual-RAG, a question-answering benchmark that targets visually grounded, knowledge-intensive questions. Unlike prior work, Visual-RAG requires text-to-image retrieval and the integration of retrieved clue images to extract visual evidence for answer generation. With Visual-RAG, we evaluate 5 open-source and 3 proprietary MLLMs, showcasing that images provide strong evidence in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LuciusLan/Visual-RAG
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection