Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

Xihang Wang; Zihan Wang; Chengkai Huang; Cao Liu; Ke Zeng; Quan Z. Sheng; Lina Yao

arXiv:2604.27600·cs.IR·May 1, 2026

Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG

Xihang Wang, Zihan Wang, Chengkai Huang, Cao Liu, Ke Zeng, Quan Z. Sheng, Lina Yao

PDF

TL;DR

This paper introduces FES-RAG, a fine-grained evidence selection framework for multimodal retrieval-augmented generation, which improves factual accuracy and reduces noise by selecting relevant fragments instead of entire documents.

Contribution

It proposes a novel fragment-level evidence selection method guided by Fragment Information Gain, enhancing retrieval precision and model performance in multimodal large language models.

Findings

01

FES-RAG outperforms state-of-the-art methods on M2RAG benchmark.

02

Achieves up to 27% relative improvement in CIDEr score.

03

Reduces context length while improving factual accuracy and coherence.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) is widely adopted for Multimodal Large Language Models (MLLMs) with external evidence to reduce hallucinations. Despite its success, most existing MRAG frameworks treat retrieved evidence as indivisible documents, implicitly assuming that all content within a document is equally informative. In practice, however, sometimes only a small fraction of a document is relevant to a given query, while the remaining content introduces substantial noise that may lead to performance degradation. We address this fundamental limitation by reframing MRAG as a fine-grained evidence selection problem. We propose Fragment-level Evidence Selection for RAG (FES-RAG), a framework that selects atomic multimodal fragments rather than entire documents as grounding evidence. FES-RAG decomposes retrieved multimodal documents into sentence-level textual fragments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.