Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

Weiqing Luo; Zongye Hu; Xiao Wang; Zhiyuan Yu; Haofeng Zhang; Ziyi Huang

arXiv:2605.13277·cs.CL·May 14, 2026

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, Ziyi Huang

PDF

TL;DR

This paper introduces an information-theoretic approach to select visual evidence for multimodal retrieval-augmented generation, improving utility and efficiency over existing methods.

Contribution

It reformulates evidence selection based on information gain, introduces a latent helpfulness concept, and proposes a training-free framework that enhances performance and reduces costs.

Findings

01

Outperforms state-of-the-art RAG baselines on MRAG-Bench and Visual-RAG.

02

Achieves substantial reductions in computational cost.

03

Demonstrates the effectiveness of information gain-based evidence ranking.

Abstract

Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.