VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

Mateo Guaman Castro; Sidharth Rajagopal; Daniel Gorbatov; Matt Schmittle; Rohan Baijal; Octi Zhang; Rosario Scalise; Sidharth Talia; Emma Romig; Celso de Melo; Byron Boots; Abhishek Gupta

arXiv:2510.20818·cs.RO·October 24, 2025

VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

Mateo Guaman Castro, Sidharth Rajagopal, Daniel Gorbatov, Matt Schmittle, Rohan Baijal, Octi Zhang, Rosario Scalise, Sidharth Talia, Emma Romig, Celso de Melo, Byron Boots, Abhishek Gupta

PDF

Open Access

TL;DR

VAMOS introduces a hierarchical vision-language-action model that separates semantic planning from embodiment constraints, enabling versatile, safe, and steerable robot navigation across diverse environments and robot types.

Contribution

The paper presents a novel hierarchical VLA model that decouples planning from embodiment grounding, allowing cross-embodied navigation and improved safety in real-world robot navigation.

Findings

01

Higher success rates in indoor and outdoor navigation.

02

Effective cross-embodied navigation across different robot types.

03

3X higher success rates by rejecting infeasible plans.

Abstract

A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Robot Manipulation and Learning · Reinforcement Learning in Robotics