From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

Guanhua Chen; Chuyue Huang; Yutong Yao; Shudong Liu; Xueqing Song; Lidia S. Chao; Derek F. Wong

arXiv:2605.15019·cs.CL·May 15, 2026

From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

Guanhua Chen, Chuyue Huang, Yutong Yao, Shudong Liu, Xueqing Song, Lidia S. Chao, Derek F. Wong

PDF

TL;DR

This paper introduces a multi-granularity evidence retrieval framework for multimodal RAG systems, improving fine-grained, verifiable evidence retrieval using element-level annotations and alignment.

Contribution

It presents GranuRAG, a novel multi-stage framework that treats visual elements as first-class retrieval units, enhancing transparency and accuracy in multimodal evidence retrieval.

Findings

01

GranuRAG achieves up to 29.2% improvement over baseline methods.

02

The benchmark GranuVistaVQA captures partial observation challenges with element-level annotations.

03

Grounding retrieval at the element level enables better error diagnosis.

Abstract

Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.