VLS: Steering Pretrained Robot Policies via Vision-Language Models
Shuo Liu, Ishneet Sukhvinder Singh, Yiqing Xu, Jiafei Duan, Ranjay Krishna

TL;DR
VLS is a training-free method that adapts pretrained robot policies at inference time using vision-language models to handle spatial and task variations without retraining.
Contribution
It introduces Vision-Language Steering (VLS), a novel inference-time control framework that guides pretrained policies using synthesized reward functions from vision-language models.
Findings
VLS achieves 31% improvement on CALVIN benchmark.
VLS attains 13% gain on LIBERO-PRO.
Demonstrated robust real-world adaptation on a Franka robot.
Abstract
Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
