Enhancing Document VQA Models via Retrieval-Augmented Generation
Eric L\'opez, Artemis Llabr\'es, Ernest Valveny

TL;DR
This paper explores retrieval-augmented generation techniques for Document VQA, demonstrating that selective evidence retrieval significantly improves accuracy over traditional concatenation methods, with text-based retrieval yielding the highest gains.
Contribution
It systematically evaluates retrieval-augmented approaches, comparing text-based and visual retrieval, and shows their effectiveness across multiple datasets and models, challenging layout-guided chunking strategies.
Findings
Text-based retrieval improves ANLS by up to 22.5 points.
Visual retrieval achieves +5.0 ANLS without OCR.
Retrieval and reranking are key to performance gains.
Abstract
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
