Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

Jiaming Zhou; Ke Ye; Jiayi Liu; Teli Ma; Zifan Wang; Ronghe Qiu; Kun-Yu Lin; Zhilin Zhao; Junwei Liang

arXiv:2505.15660·cs.RO·October 21, 2025

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang

PDF

Open Access 1 Repo

TL;DR

This paper introduces AGNOSTOS, a new benchmark for evaluating cross-task zero-shot generalization in robotic manipulation, and proposes X-ICM, a method that improves generalization by conditioning language models on demonstrations.

Contribution

The paper presents AGNOSTOS, a comprehensive benchmark for cross-task generalization, and introduces X-ICM, a novel method leveraging in-context learning and dynamics-guided selection to enhance zero-shot transfer.

Findings

01

Current VLA models struggle with unseen tasks.

02

X-ICM significantly improves zero-shot generalization performance.

03

AGNOSTOS provides a rigorous platform for evaluating generalization in manipulation.

Abstract

The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiaming-zhou/X-ICM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Visual and Cognitive Learning Processes