MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

Dannong Xu; Zhongyu Yang; Jun Chen; Yingfang Yuan; Ming Hu; Lei Sun; Luc Van Gool; Danda Pani Paudel; Chun-Mei Feng

arXiv:2603.05697·cs.CV·March 9, 2026

MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

Dannong Xu, Zhongyu Yang, Jun Chen, Yingfang Yuan, Ming Hu, Lei Sun, Luc Van Gool, Danda Pani Paudel, Chun-Mei Feng

PDF

Open Access

TL;DR

MultiHaystack is a new benchmark that evaluates the combined retrieval and reasoning capabilities of multimodal models over large, heterogeneous datasets, revealing significant challenges in current retrieval methods.

Contribution

This paper introduces MultiHaystack, the first large-scale benchmark for assessing retrieval and reasoning in multimodal models across diverse media types.

Findings

01

Models perform well with provided evidence but poorly when retrieving from large pools.

02

E5-V achieves only 40.8% Recall@1 in retrieval.

03

Reasoning accuracy drops from 80.86% to 51.4% with top-5 retrieval.

Abstract

Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning