Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG
Xihang Wang, Zihan Wang, Chengkai Huang, Cao Liu, Ke Zeng, Quan Z. Sheng, Lina Yao

TL;DR
This paper introduces FES-RAG, a fine-grained evidence selection framework for multimodal retrieval-augmented generation, which improves factual accuracy and reduces noise by selecting relevant fragments instead of entire documents.
Contribution
It proposes a novel fragment-level evidence selection method guided by Fragment Information Gain, enhancing retrieval precision and model performance in multimodal large language models.
Findings
FES-RAG outperforms state-of-the-art methods on M2RAG benchmark.
Achieves up to 27% relative improvement in CIDEr score.
Reduces context length while improving factual accuracy and coherence.
Abstract
Multimodal Retrieval-Augmented Generation (MRAG) is widely adopted for Multimodal Large Language Models (MLLMs) with external evidence to reduce hallucinations. Despite its success, most existing MRAG frameworks treat retrieved evidence as indivisible documents, implicitly assuming that all content within a document is equally informative. In practice, however, sometimes only a small fraction of a document is relevant to a given query, while the remaining content introduces substantial noise that may lead to performance degradation. We address this fundamental limitation by reframing MRAG as a fine-grained evidence selection problem. We propose Fragment-level Evidence Selection for RAG (FES-RAG), a framework that selects atomic multimodal fragments rather than entire documents as grounding evidence. FES-RAG decomposes retrieved multimodal documents into sentence-level textual fragments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
