Benchmarking the Generality of Vision-Language-Action Models

Pranav Guruprasad; Sudipta Chowdhury; Harsh Sikka; Mridul Sharma; Helen Lu; Sean Rivera; Aryan Khurana; Hangliang Ren; Yangyue Wang

arXiv:2512.11315·cs.LG·December 15, 2025

Benchmarking the Generality of Vision-Language-Action Models

Pranav Guruprasad, Sudipta Chowdhury, Harsh Sikka, Mridul Sharma, Helen Lu, Sean Rivera, Aryan Khurana, Hangliang Ren, Yangyue Wang

PDF

Open Access

TL;DR

This paper introduces MultiNet v1.0, a comprehensive benchmark to evaluate the cross-domain generality of vision-language models and action models, revealing current models' limitations in generalizing beyond training distributions.

Contribution

The paper presents MultiNet v1.0, a unified benchmark for assessing the generality of vision-language models across multiple capability regimes, addressing fragmented evaluation practices.

Findings

01

Models show significant performance drops on unseen domains and modalities.

02

Current models suffer from modality misalignment and output instability.

03

There is a notable gap between the goal of generalist AI and current model capabilities.

Abstract

Generalist multimodal agents are expected to unify perception, language, and control - operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today's foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Language and cultural evolution