Document Haystacks: Vision-Language Reasoning Over Piles of 1000+   Documents

Jun Chen; Dannong Xu; Junjie Fei; Chun-Mei Feng; Mohamed Elhoseiny

arXiv:2411.16740·cs.CV·December 9, 2024

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny

PDF

Open Access 1 Repo

TL;DR

This paper introduces large-scale visual document benchmarks and a novel vision-centric retrieval-augmented generation framework, V-RAG, significantly improving large multimodal models' ability to perform complex reasoning over thousands of images.

Contribution

The paper presents DocHaystack and InfoHaystack benchmarks for large-scale visual document understanding and introduces V-RAG, a new retrieval-augmented generation framework optimized for vision-language reasoning.

Findings

01

V-RAG improves Recall@1 by 9-11% on benchmarks.

02

Benchmarks evaluate large-scale visual document retrieval.

03

V-RAG enables efficient reasoning over thousands of images.

Abstract

Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vision-cair/dochaystacks
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques