Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts
Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, Maosong Sun

TL;DR
This paper introduces M$^2$RAG, a benchmark for evaluating multi-modal retrieval-augmented generation models across four tasks, and proposes MM-RAIT, an instruction tuning method that significantly improves model performance in multi-modal contexts.
Contribution
The paper presents M$^2$RAG, a new benchmark for multi-modal RAG tasks, and introduces MM-RAIT, an instruction tuning technique that enhances multi-modal model responses.
Findings
MM-RAIT improves response quality significantly.
Outperforms MiniCPM-V 2.6 and Qwen2-VL by 34% and 33%.
Demonstrates effectiveness across four multi-modal tasks.
Abstract
With the rapid advancement of Multi-modal Large Language Models (MLLMs), their capability in understanding both images and text has greatly improved. However, their potential for leveraging multi-modal contextual information in Retrieval-Augmented Generation (RAG) remains largely underexplored. To address this gap, this paper introduces Multi-Modal Retrieval-Augmented Generation (MRAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as contextual input for RAG modeling. To enhance the context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection
