LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision Language Navigation in Continuous Environments

Hongyu Ding; Ziming Xu; Yudong Fang; You Wu; Zixuan Chen; Jieqi Shi; Jing Huo; Yifan Zhang; Yang Gao

arXiv:2510.19655·cs.RO·March 5, 2026

LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision Language Navigation in Continuous Environments

Hongyu Ding, Ziming Xu, Yudong Fang, You Wu, Zixuan Chen, Jieqi Shi, Jing Huo, Yifan Zhang, Yang Gao

PDF

Open Access

TL;DR

LaViRA introduces a hierarchical, multimodal large language model-based framework for zero-shot vision-language navigation in continuous environments, significantly improving generalization and reasoning without environment-specific training.

Contribution

It proposes a modular, hierarchical approach leveraging different scales of multimodal large language models for reasoning, grounding, and control in zero-shot navigation tasks.

Findings

01

Outperforms state-of-the-art on VLN-CE benchmark

02

Demonstrates superior generalization to unseen environments

03

Maintains transparency and efficiency for real-world use

Abstract

LaViRA: Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires an agent to navigate unseen environments based on natural language instructions without any prior training. Current methods face a critical trade-off: either rely on environment-specific waypoint predictors that limit scene generalization, or underutilize the reasoning capabilities of large models during navigation. We introduce LaViRA, a simple yet effective zero-shot framework that addresses this dilemma by decomposing action into a coarse-to-fine hierarchy: Language Action for high-level planning, Vision Action for middle-level perceptual grounding, and Robot Action for low-level control. This modular decomposition allows us to leverage the distinct strengths of different scales of Multimodal Large Language Models (MLLMs) at each stage, creating a system that is powerful in its reasoning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning