Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search
Cyrus Neary, Omar G. Younis, Artur Kuramshin, Ozgur Aslan, Glen Berseth

TL;DR
VLAPS enhances pre-trained vision-language-action policies for robotics by integrating model-based search, significantly improving task success rates and safety in out-of-distribution scenarios through a novel planning framework.
Contribution
Introduces VLAPS, a framework that embeds model-based search into VLA policy inference, enabling more robust and efficient robotic task execution.
Findings
VLAPS outperforms VLA-only baselines by up to 67 percentage points.
The method enables better exploration of large, language-conditioned search spaces.
VLAPS improves safety and performance in zero-shot robotic tasks.
Abstract
Pre-trained vision-language-action (VLA) models offer a promising foundation for generalist robot policies, but often produce brittle behaviors or unsafe failures when deployed zero-shot in out-of-distribution scenarios. We present Vision-Language-Action Planning & Search (VLAPS) -- a novel framework and accompanying algorithms that embed model-based search into the inference procedure of pre-trained VLA policies to improve their performance on robotic tasks. Specifically, our method biases a modified Monte Carlo Tree Search (MCTS) algorithm -- run using a model of the target environment -- using action priors defined by the VLA policy. By using VLA-derived abstractions and priors in model-based search, VLAPS efficiently explores language-conditioned robotics tasks whose search spaces would otherwise be intractably large. Conversely, by integrating model-based search with the VLA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
