VLAs are Confined yet Capable of Generalizing to Novel Instructions
Quanyi Li

TL;DR
This paper introduces a method to recombine learned behaviors in vision-language-action models by manipulating internal representations, significantly improving extrapolation to novel tasks.
Contribution
It presents a novel inference-time manipulation of text latent representations to enable VLAs to generalize to unseen task combinations.
Findings
Interpreting text latent allows recombination of behaviors for novel tasks.
Interpolating text latent boosts success rate to 83% on new benchmark.
Decoding text latent produces human-unreadable prompts that still effectively instruct VLAs.
Abstract
Vision-language-action models (VLAs) often achieve high performance on demonstrated tasks but struggle significantly when required to extrapolate, combining skills learned from different tasks in novel ways. For instance, VLAs might successfully put the cream cheese in the bowl and put the bowl on top of the cabinet, yet still fail to put the cream cheese on top of the cabinet. In this work, we demonstrate that behaviors from distinct tasks can be effectively recombined by manipulating the VLA's internal representations at inference time. Concretely, we identify the text latent by averaging the text tokens' hidden states across all demonstrated trajectories for a specific base task. For executing an extrapolated task, we can temporally interpolate the text latent of the two base tasks and add it back to the text hidden states, so sub-behaviors from the two tasks will be activated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
