Benchmarking the Generality of Vision-Language-Action Models
Pranav Guruprasad, Sudipta Chowdhury, Harsh Sikka, Mridul Sharma, Helen Lu, Sean Rivera, Aryan Khurana, Hangliang Ren, Yangyue Wang

TL;DR
This paper introduces MultiNet v1.0, a comprehensive benchmark to evaluate the cross-domain generality of vision-language models and action models, revealing current models' limitations in generalizing beyond training distributions.
Contribution
The paper presents MultiNet v1.0, a unified benchmark for assessing the generality of vision-language models across multiple capability regimes, addressing fragmented evaluation practices.
Findings
Models show significant performance drops on unseen domains and modalities.
Current models suffer from modality misalignment and output instability.
There is a notable gap between the goal of generalist AI and current model capabilities.
Abstract
Generalist multimodal agents are expected to unify perception, language, and control - operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today's foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Language and cultural evolution
