VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Qiuchen Wang; Shihang Wang; Yu Zeng; Qiang Zhang; Fanrui Zhang; Zhuoning Guo; Bosi Zhang; Wenxuan Huang; Lin Chen; Zehui Chen; Pengjun Xie; Ruixue Ding

arXiv:2602.12735·cs.CV·April 22, 2026

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang, Zhuoning Guo, Bosi Zhang, Wenxuan Huang, Lin Chen, Zehui Chen, Pengjun Xie, Ruixue Ding

PDF

1 Repo 1 Datasets

TL;DR

VimRAG introduces a multimodal retrieval-augmented reasoning framework that structures agent memory as a dynamic graph, enabling efficient handling of long, multimodal contexts and achieving state-of-the-art results.

Contribution

The paper presents a novel graph-structured memory and a graph-guided policy for multimodal RAG, improving reasoning over long, complex visual and textual data.

Findings

01

VimRAG outperforms existing methods on multimodal RAG benchmarks.

02

The graph-modulated memory effectively prioritizes pivotal evidence.

03

The approach enhances reasoning with long, token-heavy visual data.

Abstract

Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Alibaba-NLP/VRAG
github

Datasets

Alibaba-NLP/xvbench
dataset· 78 dl
78 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.