Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang

TL;DR
This paper introduces AGNOSTOS, a new benchmark for evaluating cross-task zero-shot generalization in robotic manipulation, and proposes X-ICM, a method that improves generalization by conditioning language models on demonstrations.
Contribution
The paper presents AGNOSTOS, a comprehensive benchmark for cross-task generalization, and introduces X-ICM, a novel method leveraging in-context learning and dynamics-guided selection to enhance zero-shot transfer.
Findings
Current VLA models struggle with unseen tasks.
X-ICM significantly improves zero-shot generalization performance.
AGNOSTOS provides a rigorous platform for evaluating generalization in manipulation.
Abstract
The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Visual and Cognitive Learning Processes
