Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
Aditya Sharma, Michael Saxon, William Yang Wang

TL;DR
This paper introduces LoCoVQA, a benchmark for testing vision language models' ability to perform extractive reasoning in long visual contexts, revealing their rapid performance decline with increasing distractors.
Contribution
The paper presents LoCoVQA, a novel dynamic benchmark that evaluates VLMs' long-context reasoning and distractor ignoring capabilities across multiple tasks.
Findings
VLMs' performance drops logarithmically with longer contexts
Current VLMs struggle to ignore irrelevant distractors in long visual contexts
State-of-the-art VLMs lack essential long-context reasoning skills
Abstract
We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCategorization, perception, and language · Language, Metaphor, and Cognition
MethodsSparse Evolutionary Training · Exponential Decay
