MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation
Sijun Dai, Qiang Huang, Xiaoxing You, Jun Yu

TL;DR
MG$^2$-RAG introduces a hierarchical multimodal knowledge graph that enhances cross-modal reasoning and retrieval in large language models, achieving state-of-the-art results with reduced computational costs.
Contribution
It proposes a lightweight multi-granularity graph framework combining textual and visual information for improved multimodal reasoning and retrieval.
Findings
Achieves state-of-the-art performance on four multimodal tasks.
Reduces graph construction time by 43.3 times on average.
Cuts down costs by 23.9 times compared to existing graph-based methods.
Abstract
Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
