View Invariant Learning for Vision-Language Navigation in Continuous Environments

Josh Qixuan Sun; Huaiyuan Weng; Xiaoying Xing; Chul Min Yeum; Mark Crowley

arXiv:2507.08831·cs.CV·February 23, 2026

View Invariant Learning for Vision-Language Navigation in Continuous Environments

Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, Mark Crowley

PDF

TL;DR

This paper introduces VIL, a view-invariant learning framework that enhances vision-language navigation in continuous environments by making policies robust to viewpoint changes, leading to significant performance improvements.

Contribution

The paper proposes a novel view-invariant post-training framework using contrastive learning and teacher-student models to improve navigation robustness to viewpoint variations.

Findings

01

Outperforms state-of-the-art on V2-VLNCE benchmarks by 8-15%.

02

Improves performance in standard VLNCE settings despite training for varied viewpoints.

03

Achieves state-of-the-art results on RxR-CE dataset and demonstrates real-robot applicability.

Abstract

Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduce a more general scenario, V $^{2}$ -VLNCE (VLNCE with Varied Viewpoints) and propose a view-invariant post-training framework, called VIL (View Invariant Learning), that makes existing navigation policies more robust to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. We also introduce a teacher-student framework for the Waypoint Predictor Module, a standard part of VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.