TL;DR
This paper introduces Steerable Policies, a hierarchical control method that leverages pretrained vision-language models and synthetic commands to improve robot task generalization and controllability.
Contribution
It proposes a novel approach to ground VLM knowledge in low-level policies through rich synthetic commands, enhancing task generalization and control.
Findings
Outperforms prior VLAs and hierarchical baselines in real-world manipulation tasks.
Enables control via a learned high-level reasoner and off-the-shelf VLM prompting.
Demonstrates improved generalization and long-horizon task performance.
Abstract
Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
