View Invariant Learning for Vision-Language Navigation in Continuous Environments
Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, Mark Crowley

TL;DR
This paper introduces VIL, a view-invariant learning framework that enhances vision-language navigation in continuous environments by making policies robust to viewpoint changes, leading to significant performance improvements.
Contribution
The paper proposes a novel view-invariant post-training framework using contrastive learning and teacher-student models to improve navigation robustness to viewpoint variations.
Findings
Outperforms state-of-the-art on V2-VLNCE benchmarks by 8-15%.
Improves performance in standard VLNCE settings despite training for varied viewpoints.
Achieves state-of-the-art results on RxR-CE dataset and demonstrates real-robot applicability.
Abstract
Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduce a more general scenario, V-VLNCE (VLNCE with Varied Viewpoints) and propose a view-invariant post-training framework, called VIL (View Invariant Learning), that makes existing navigation policies more robust to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. We also introduce a teacher-student framework for the Waypoint Predictor Module, a standard part of VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
