LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation
Nitesh Subedi, Adam Haroon, Samuel Tetteh, Prajwal Koirala, Cody Fleming, Soumik Sarkar

TL;DR
LCLA introduces a modular framework for vision-language navigation that aligns sensory observations to a latent expert policy, enabling robust zero-shot generalization and efficient inference by decoupling perception and control.
Contribution
The paper presents a novel approach that learns a stable latent alignment for vision-language navigation, improving generalization and modularity over end-to-end methods.
Findings
Strong in-distribution navigation performance
Robust zero-shot generalization to unseen environments
Lightweight inference with modular perception-action interface
Abstract
We propose LCLA (Language-Conditioned Latent Alignment), a framework for vision-language navigation that learns modular perception-action interfaces by aligning sensory observations to a latent representation of an expert policy. The expert is first trained with privileged state information, inducing a latent space sufficient for control, after which its latent interface and action head are frozen. A lightweight adapter is then trained to map raw visual-language observations, via a frozen vision-language model, into the expert's latent space, reducing the problem of visuomotor learning to supervised latent alignment rather than end-to-end policy optimization. This decoupling enforces a stable contract between perception and control, enabling expert behavior to be reused across sensing modalities and environmental variations. We instantiate LCLA and evaluate it on a vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
