VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Zelai Xu; Zhexuan Xu; Xiangmin Yi; Huining Yuan; Mo Guang; Kaiwen Long; Xinlei Chen; Yi Wu; Chao Yu; Yu Wang

arXiv:2506.02387·cs.AI·April 14, 2026

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Mo Guang, Kaiwen Long, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang

PDF

2 Repos 1 Datasets

TL;DR

VS-Bench is a new multimodal benchmark designed to evaluate vision language models' strategic abilities in multi-agent environments, addressing a gap in existing single-agent or text-only benchmarks.

Contribution

The paper introduces VS-Bench, a comprehensive benchmark with environments and metrics for assessing VLMs' perception, reasoning, and decision-making in multi-agent scenarios.

Findings

01

Current VLMs excel at perception but lag in reasoning and decision-making.

02

The best model achieves 46.6% prediction accuracy and 31.4% normalized return.

03

Analysis reveals key factors affecting VLMs' strategic performance.

Abstract

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

zelaix/VS-Bench
dataset· 107 dl
107 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.