Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E., Gonzalez, Trevor Darrell, David M. Chan

TL;DR
This paper introduces Visual Haystacks, a new long-context, vision-centric benchmark for multi-image question answering, and MIRAGE, a lightweight retrieval-augmented generation framework that significantly improves multi-image reasoning performance.
Contribution
The paper presents a novel benchmark for multi-image reasoning and a scalable retrieval-augmented framework that enhances performance on large-scale visual question answering tasks.
Findings
Current models struggle with cross-image reasoning and bias issues.
MIRAGE improves performance by up to 13% on the Visual Haystacks benchmark.
MIRAGE can process up to 10,000 images on a single GPU.
Abstract
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as…
Peer Reviews
Decision·ICLR 2025 Poster
* Novel Multi-Image QA Benchmark: The authors introduce an interesting multi-image QA benchmark, Visual Haystacks, designed around a vision-centric "needle-in-a-haystack" scenario, providing a fresh and challenging setting for the LMM evaluation. * Comprehensive Model Evaluation: The paper conducts a thorough evaluation of LMMs on the VHs benchmark, uncovering important insights into current models, such as vulnerability to visual distractors, challenges with multi-image understanding, and ten
* Limited Object Diversity: The authors constructed the VHs benchmark using objects from the COCO dataset, which contains only 80 object categories. This limited selection may restrict the diversity and comprehensiveness of the benchmark, potentially affecting its ability to evaluate models across a broader range of visual scenarios. * Restricted Question Diversity: The authors appear to rely on a few simple templates to generate questions, which may restrict the variety of question types in th
- I generally feel the direction is important to our community where design meaningful Visual Haystack benchmark for evaluating VLM. - Some interesting points are discovered when evaluating models on the proposed benchmark. Since random guess could achieve 50% accuracy in the proposed benchmark, some open-sourced VLMs performance significantly drop even the Haystack size is very small. However, those models maintain high scores in some public evaluation-datasets. - Some detailed experiments ar
- Benchmark construction is still mainly centered around recognition tasks, based on benchmark design principles listed in Line129~138. Basically, it requires a strong recognition among all the input images, rather than true visual reasoning. - Based on the Figure 2 and 3, certain models, such as Gemini, GPT and the proposed MIRAGE, consistently perform better on the proposed multi-needle challenges compared to single-needle tasks. However, the multi-needle challenges are intentionally designed
1. This paper introduced a new visual needle in a haystack benchmark which composed of 1k yes/no questions. 2. Evaluated on both open-source and close-source models and gained three insightful findings. 3. Introduced a new baseline called MIRAGE for better handling of visual haystack tasks.
1. The questions are only limited to yes/no questions. 2. The question template are very limited, seems only three. 3. MIRAGE has a significant performance drop in 4 out of 7 general VQA tasks. 4. The approach of MIRAGE, deselecting unrelated (distracting) images somehow circumvents the VH challenge, as the this challenge lies in how model can reasoning in long context. 5. The task of finding a target object seems still not simulating a real world scenario of long context visual reasoning t
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
