ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao

TL;DR
ViDoRAG is a multi-agent retrieval-augmented generation framework designed for complex reasoning over visually rich documents, addressing limitations of existing methods by integrating multi-modal retrieval and iterative reasoning, and demonstrating significant performance improvements.
Contribution
The paper introduces ViDoRAG, a novel multi-agent RAG framework with a GMM-based hybrid retrieval strategy and iterative reasoning workflow for visually rich documents.
Findings
Outperforms existing methods by over 10% on ViDoSeek benchmark.
Effectively integrates textual and visual features for complex reasoning.
Demonstrates strong generalization and reasoning capabilities.
Abstract
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Weight Decay · Dense Connections · Attention Dropout · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay
