From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG
Guanhua Chen, Chuyue Huang, Yutong Yao, Shudong Liu, Xueqing Song, Lidia S. Chao, Derek F. Wong

TL;DR
This paper introduces a multi-granularity evidence retrieval framework for multimodal RAG systems, improving fine-grained, verifiable evidence retrieval using element-level annotations and alignment.
Contribution
It presents GranuRAG, a novel multi-stage framework that treats visual elements as first-class retrieval units, enhancing transparency and accuracy in multimodal evidence retrieval.
Findings
GranuRAG achieves up to 29.2% improvement over baseline methods.
The benchmark GranuVistaVQA captures partial observation challenges with element-level annotations.
Grounding retrieval at the element level enables better error diagnosis.
Abstract
Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
