What Limits Vision-and-Language Navigation ?

Yunheng Wang; Yuetong Fang; Taowen Wang; Lusong Li; Kun Liu; Junzhe Xu; Zizhao Yuan; Yixiao Feng; Jiaxi Zhang; Wei Lu; Zecui Zeng; Renjing Xu

arXiv:2605.13328·cs.RO·May 14, 2026

What Limits Vision-and-Language Navigation ?

Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li, Kun Liu, Junzhe Xu, Zizhao Yuan, Yixiao Feng, Jiaxi Zhang, Wei Lu, Zecui Zeng, Renjing Xu

PDF

1 Repo

TL;DR

StereoNav enhances real-world vision-and-language navigation by using stereo vision and target-location priors to improve robustness, grounding, and performance across domains.

Contribution

The paper introduces StereoNav, a novel framework that leverages stereo vision and target-location priors to improve cross-domain robustness in VLN tasks.

Findings

01

StereoNav achieves state-of-the-art performance on R2R-CE and RxR-CE benchmarks.

02

StereoNav significantly improves real-world navigation reliability in robotic deployments.

03

StereoNav uses fewer parameters and less training data than prior scaling approaches.

Abstract

Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://yunheng-wang.github.io/stereonav-public.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.