Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

Pranav Guruprasad; Yangyue Wang; Sudipta Chowdhury; Harshvardhan Sikka; Paul Pu Liang

arXiv:2505.05540·cs.CV·June 18, 2025

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

Pranav Guruprasad, Yangyue Wang, Sudipta Chowdhury, Harshvardhan Sikka, Paul Pu Liang

PDF

Open Access 1 Repo

TL;DR

This paper introduces MultiNet v0.2, a benchmark for evaluating vision-language-action models' zero-shot generalization in procedurally generated, out-of-distribution environments, revealing key limitations and strengths of current models.

Contribution

The paper presents a new comprehensive benchmark, MultiNet v0.2, for assessing the generalization of VLA models in OOD environments, along with an analysis of their performance and factors affecting it.

Findings

01

All models show limited zero-shot OOD generalization.

02

VLAs outperform other models due to architecture.

03

Prompt engineering significantly affects model performance.

Abstract

Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in procedurally out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLMs and VLAs - including GPT-4o, GPT-4.1, OpenVLA, Pi0 Base, and Pi0 FAST - on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights: (1) all evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexity; (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ManifoldRG/MultiNet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robot Manipulation and Learning

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Softmax · Absolute Position Encodings · Residual Connection