vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models

Suhwan Choi; Yunsung Lee; Yubeen Park; Chris Dongjoo Kim; Ranjay Krishna; Dieter Fox; Youngjae Yu

arXiv:2603.13966·cs.AI·April 20, 2026

vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models

Suhwan Choi, Yunsung Lee, Yubeen Park, Chris Dongjoo Kim, Ranjay Krishna, Dieter Fox, Youngjae Yu

PDF

1 Repo

TL;DR

vla-eval is an open-source framework that simplifies and accelerates the comprehensive evaluation of vision-language-action models across multiple benchmarks.

Contribution

It introduces a unified, Docker-based evaluation harness that decouples model inference from benchmark execution, enabling scalable and reproducible assessments.

Findings

01

Supports 14 benchmarks and 6 model servers.

02

Achieves up to 47x speedup in evaluation time.

03

Reproduces scores across multiple codebases and benchmarks.

Abstract

Vision-Language-Action (VLA) models are increasingly evaluated across multiple simulation benchmarks, yet adding each benchmark to an evaluation pipeline requires resolving incompatible dependencies, matching underspecified evaluation protocols, and reverse-engineering undocumented preprocessing. This burden scales with the number of models and benchmarks, making comprehensive evaluation impractical for most teams. We present vla-eval, an open-source evaluation harness that eliminates this per-benchmark cost by decoupling model inference from benchmark execution through a WebSocket+msgpack protocol with Docker-based environment isolation. Models integrate once by implementing a single predict() method; benchmarks integrate once via a four-method interface; the full cross-evaluation matrix works automatically. The framework supports 14 simulation benchmarks and six model servers.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/vla-evaluation-harness
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.