TL;DR
StereoNav enhances real-world vision-and-language navigation by using stereo vision and target-location priors to improve robustness, grounding, and performance across domains.
Contribution
The paper introduces StereoNav, a novel framework that leverages stereo vision and target-location priors to improve cross-domain robustness in VLN tasks.
Findings
StereoNav achieves state-of-the-art performance on R2R-CE and RxR-CE benchmarks.
StereoNav significantly improves real-world navigation reliability in robotic deployments.
StereoNav uses fewer parameters and less training data than prior scaling approaches.
Abstract
Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
