Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

Zhenghao Liu; Xingsheng Zhu; Tianshuo Zhou; Xinyi Zhang; Xiaoyuan Yi; Yukun Yan; Ge Yu; Maosong Sun

arXiv:2502.17297·cs.AI·August 8, 2025

Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, Maosong Sun

PDF

2 Repos 5 Models 1 Datasets

TL;DR

This paper introduces M$^2$RAG, a benchmark for evaluating multi-modal retrieval-augmented generation models across four tasks, and proposes MM-RAIT, an instruction tuning method that significantly improves model performance in multi-modal contexts.

Contribution

The paper presents M$^2$RAG, a new benchmark for multi-modal RAG tasks, and introduces MM-RAIT, an instruction tuning technique that enhances multi-modal model responses.

Findings

01

MM-RAIT improves response quality significantly.

02

Outperforms MiniCPM-V 2.6 and Qwen2-VL by 34% and 33%.

03

Demonstrates effectiveness across four multi-modal tasks.

Abstract

With the rapid advancement of Multi-modal Large Language Models (MLLMs), their capability in understanding both images and text has greatly improved. However, their potential for leveraging multi-modal contextual information in Retrieval-Augmented Generation (RAG) remains largely underexplored. To address this gap, this paper introduces Multi-Modal Retrieval-Augmented Generation (M $^{2}$ RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as contextual input for RAG modeling. To enhance the context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

whalezzz/M2RAG
dataset· 64 dl
64 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection