VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

Borong Zhang; Jiahao Li; Jiachen Shen; Yishuai Cai; Yuhao Zhang; Yuanpei Chen; Juntao Dai; Jiaming Ji; Yaodong Yang

arXiv:2512.22539·cs.RO·December 30, 2025

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

Borong Zhang, Jiahao Li, Jiachen Shen, Yishuai Cai, Yuhao Zhang, Yuanpei Chen, Juntao Dai, Jiaming Ji, Yaodong Yang

PDF

Open Access 1 Datasets

TL;DR

VLA-Arena is a comprehensive, open-source benchmark framework designed to evaluate vision-language-action models across diverse, fine-grained tasks, revealing key limitations and fostering future research in generalist robot policies.

Contribution

We introduce VLA-Arena, a structured, multi-dimensional benchmark with a novel task design framework and extensive evaluation tools for assessing VLAs' capabilities and robustness.

Findings

01

VLAs tend to memorize rather than generalize.

02

Robustness is asymmetric across perturbations.

03

Current models struggle with safety and long-horizon tasks.

Abstract

While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena's 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

qirui095/Flat-Stove-Turn-On
dataset· 226 dl
226 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning