Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents
Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny

TL;DR
This paper introduces large-scale visual document benchmarks and a novel vision-centric retrieval-augmented generation framework, V-RAG, significantly improving large multimodal models' ability to perform complex reasoning over thousands of images.
Contribution
The paper presents DocHaystack and InfoHaystack benchmarks for large-scale visual document understanding and introduces V-RAG, a new retrieval-augmented generation framework optimized for vision-language reasoning.
Findings
V-RAG improves Recall@1 by 9-11% on benchmarks.
Benchmarks evaluate large-scale visual document retrieval.
V-RAG enables efficient reasoning over thousands of images.
Abstract
Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques
