MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG
Xihang Wang, Zihan Wang, Chengkai Huang, Quan Z. Sheng, Lina Yao

TL;DR
This paper introduces MEG-RAG, a semantic-aware metric and framework for improving evidence selection in multimodal retrieval-augmented generation, reducing hallucinations and enhancing factual accuracy.
Contribution
It proposes MEG, a novel semantic grounding metric, and MEG-RAG, a training framework that aligns retrieved evidence with semantic anchors for better multimodal answer quality.
Findings
MEG-RAG outperforms strong baselines on the M$^2$RAG benchmark.
It demonstrates improved accuracy and multimodal consistency in generated outputs.
The approach generalizes well across different teacher models.
Abstract
Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
