Visually Prompted Benchmarks Are Surprisingly Fragile

Haiwen Feng; Long Lian; Lisa Dunlap; Jiahao Shu; XuDong Wang; Renhao Wang; Trevor Darrell; Alane Suhr; Angjoo Kanazawa

arXiv:2512.17875·cs.CV·January 14, 2026

Visually Prompted Benchmarks Are Surprisingly Fragile

Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, XuDong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, Angjoo Kanazawa

PDF

Open Access 1 Datasets

TL;DR

This paper reveals that visually prompted benchmarks for vision-language models are highly sensitive to minor visual details, affecting model rankings and evaluation reliability, and introduces VPBench to improve robustness.

Contribution

The authors identify fragility in current visually prompted benchmarks and create VPBench, a larger, more stable benchmark with multiple visual marker variants to enhance evaluation consistency.

Findings

01

Model rankings are highly sensitive to visual marker color and size.

02

Benchmark setup details significantly influence model performance.

03

VPBench provides a more stable and comprehensive evaluation framework.

Abstract

A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

longlian/VPBench
dataset· 513 dl
513 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling