VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

Mingjie Xu; Jinpeng Chen; Yuzhi Zhao; Jason Chun Lok Li; Yue Qiu; Zekang Du; Mengyang Wu; Pingping Zhang; Kun Li; Hongzheng Yang; Wenao Ma; Jiaheng Wei; Qinbin Li; Kangcheng Liu; Wenqiang Lei

arXiv:2511.11438·cs.CV·November 17, 2025

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models

Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei

PDF

Open Access 1 Video

TL;DR

VP-Bench introduces a comprehensive benchmark to evaluate multimodal large language models' ability to perceive and utilize visual prompts, addressing a key gap in understanding their performance in grounded vision-language tasks.

Contribution

The paper presents VP-Bench, a two-stage evaluation framework for assessing MLLMs' VP perception and application, including a large-scale dataset and analysis of influencing factors.

Findings

01

28 MLLMs evaluated, including GPT-4o and open-source models.

02

VP understanding varies significantly with prompt attributes and model scale.

03

VP-Bench sets a new standard for grounded referring question comprehension.

Abstract

Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems