ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Qiuchen Wang; Ruixue Ding; Zehui Chen; Weiqi Wu; Shihang Wang; Pengjun Xie; Feng Zhao

arXiv:2502.18017·cs.CV·June 4, 2025

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao

PDF

Open Access 1 Repo 4 Datasets 1 Video

TL;DR

ViDoRAG is a multi-agent retrieval-augmented generation framework designed for complex reasoning over visually rich documents, addressing limitations of existing methods by integrating multi-modal retrieval and iterative reasoning, and demonstrating significant performance improvements.

Contribution

The paper introduces ViDoRAG, a novel multi-agent RAG framework with a GMM-based hybrid retrieval strategy and iterative reasoning workflow for visually rich documents.

Findings

01

Outperforms existing methods by over 10% on ViDoSeek benchmark.

02

Effectively integrates textual and visual features for complex reasoning.

03

Demonstrates strong generalization and reasoning capabilities.

Abstract

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Alibaba-NLP/ViDoRAG
noneOfficial

Datasets

Videos

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Weight Decay · Dense Connections · Attention Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay