Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search

Cyrus Neary; Omar G. Younis; Artur Kuramshin; Ozgur Aslan; Glen Berseth

arXiv:2508.12211·cs.RO·November 14, 2025

Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search

Cyrus Neary, Omar G. Younis, Artur Kuramshin, Ozgur Aslan, Glen Berseth

PDF

TL;DR

VLAPS enhances pre-trained vision-language-action policies for robotics by integrating model-based search, significantly improving task success rates and safety in out-of-distribution scenarios through a novel planning framework.

Contribution

Introduces VLAPS, a framework that embeds model-based search into VLA policy inference, enabling more robust and efficient robotic task execution.

Findings

01

VLAPS outperforms VLA-only baselines by up to 67 percentage points.

02

The method enables better exploration of large, language-conditioned search spaces.

03

VLAPS improves safety and performance in zero-shot robotic tasks.

Abstract

Pre-trained vision-language-action (VLA) models offer a promising foundation for generalist robot policies, but often produce brittle behaviors or unsafe failures when deployed zero-shot in out-of-distribution scenarios. We present Vision-Language-Action Planning & Search (VLAPS) -- a novel framework and accompanying algorithms that embed model-based search into the inference procedure of pre-trained VLA policies to improve their performance on robotic tasks. Specifically, our method biases a modified Monte Carlo Tree Search (MCTS) algorithm -- run using a model of the target environment -- using action priors defined by the VLA policy. By using VLA-derived abstractions and priors in model-based search, VLAPS efficiently explores language-conditioned robotics tasks whose search spaces would otherwise be intractably large. Conversely, by integrating model-based search with the VLA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.