Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

Goeric Huybrechts; Srikanth Ronanki; Sai Muralidhar Jayanthi; Jack Fitzgerald; Srinivasan Veeravanallur

arXiv:2507.15882·cs.CV·August 6, 2025

Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark

Goeric Huybrechts, Srikanth Ronanki, Sai Muralidhar Jayanthi, Jack Fitzgerald, Srinivasan Veeravanallur

PDF

Open Access 2 Datasets

TL;DR

Document Haystack is a new benchmark that evaluates vision language models on long, complex documents with multimodal content, addressing a key gap in multimodal AI research.

Contribution

It introduces a comprehensive, automated benchmark with diverse long documents and embedded multimodal elements to assess VLMs' retrieval and understanding capabilities.

Findings

01

Prominent VLMs show limited performance on long documents

02

Benchmark reveals challenges in multimodal retrieval at scale

03

Provides a standardized platform for future research

Abstract

The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image "needles" at various depths within the documents to challenge VLMs' retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies