From Foresight to Forethought: VLM-In-the-Loop Policy Steering via   Latent Alignment

Yilin Wu; Ran Tian; Gokul Swamy; Andrea Bajcsy

arXiv:2502.01828·cs.RO·May 5, 2025

From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment

Yilin Wu, Ran Tian, Gokul Swamy, Andrea Bajcsy

PDF

Open Access

TL;DR

FOREWARN introduces a novel framework that uses a latent world model to enable Vision Language Models to effectively verify and steer robotic policies by reasoning about future states in natural language, improving robustness and generalization.

Contribution

The paper proposes a decoupled approach that separates future state prediction from evaluation, enabling VLMs to serve as open-vocabulary verifiers for robot policy steering.

Findings

01

Effective policy steering across diverse tasks

02

Bridges representational gaps between latent states and VLM reasoning

03

Enhances robustness and generalization of robotic policies

Abstract

While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM's burden of predicting action…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications

MethodsALIGN