TL;DR
vla-eval is an open-source framework that simplifies and accelerates the comprehensive evaluation of vision-language-action models across multiple benchmarks.
Contribution
It introduces a unified, Docker-based evaluation harness that decouples model inference from benchmark execution, enabling scalable and reproducible assessments.
Findings
Supports 14 benchmarks and 6 model servers.
Achieves up to 47x speedup in evaluation time.
Reproduces scores across multiple codebases and benchmarks.
Abstract
Vision-Language-Action (VLA) models are increasingly evaluated across multiple simulation benchmarks, yet adding each benchmark to an evaluation pipeline requires resolving incompatible dependencies, matching underspecified evaluation protocols, and reverse-engineering undocumented preprocessing. This burden scales with the number of models and benchmarks, making comprehensive evaluation impractical for most teams. We present vla-eval, an open-source evaluation harness that eliminates this per-benchmark cost by decoupling model inference from benchmark execution through a WebSocket+msgpack protocol with Docker-based environment isolation. Models integrate once by implementing a single predict() method; benchmarks integrate once via a four-method interface; the full cross-evaluation matrix works automatically. The framework supports 14 simulation benchmarks and six model servers.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
