CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations
Jialu Li, Hao Tan, Mohit Bansal

TL;DR
This paper introduces CLEAR, a method that enhances vision-language navigation by developing cross-lingual, environment-agnostic representations, enabling better generalization across languages and unseen environments.
Contribution
The paper proposes a novel approach to learn shared cross-lingual and environment-agnostic visual representations for VLN tasks, improving generalization and transferability.
Findings
Significant performance improvements on Room-Across-Room dataset.
Effective transfer of learned representations to other VLN tasks.
Enhanced generalization to unseen environments and languages.
Abstract
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. In this paper, we aim to solve two key challenges in this task: utilizing multilingual instructions for improved instruction-path grounding and navigating through new environments that are unseen during training. To address these challenges, we propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. First, our agent learns a shared and visually-aligned cross-lingual language representation for the three languages (English, Hindi and Telugu) in the Room-Across-Room dataset. Our language representation learning is guided by text pairs that are aligned by visual information. Second, our agent learns an environment-agnostic visual representation by maximizing the similarity between semantically-aligned image pairs (with constraints on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
