Losing Visual Needles in Image Haystacks: Vision Language Models are   Easily Distracted in Short and Long Contexts

Aditya Sharma; Michael Saxon; William Yang Wang

arXiv:2406.16851·cs.CL·October 7, 2024

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts

Aditya Sharma, Michael Saxon, William Yang Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces LoCoVQA, a benchmark for testing vision language models' ability to perform extractive reasoning in long visual contexts, revealing their rapid performance decline with increasing distractors.

Contribution

The paper presents LoCoVQA, a novel dynamic benchmark that evaluates VLMs' long-context reasoning and distractor ignoring capabilities across multiple tasks.

Findings

01

VLMs' performance drops logarithmically with longer contexts

02

Current VLMs struggle to ignore irrelevant distractors in long visual contexts

03

State-of-the-art VLMs lack essential long-context reasoning skills

Abstract

We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts· underline

Taxonomy

TopicsCategorization, perception, and language · Language, Metaphor, and Cognition

MethodsSparse Evolutionary Training · Exponential Decay