LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

Haihong Hao; Lei Chen; Mingfei Han; Changlin Li; Dong An; Yuqiang Yang; Zhihui Li; Xiaojun Chang

arXiv:2603.29165·cs.CV·April 1, 2026

LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

Haihong Hao, Lei Chen, Mingfei Han, Changlin Li, Dong An, Yuqiang Yang, Zhihui Li, Xiaojun Chang

PDF

1 Repo

TL;DR

LatentPilot introduces a novel approach for vision-and-language navigation that models future visual dynamics during training, enabling better decision-making without future frame access at inference.

Contribution

It proposes a training paradigm that learns action-conditioned visual dynamics and latent tokens, allowing the agent to anticipate future observations and improve navigation performance.

Findings

01

Achieves new state-of-the-art results on R2R-CE, RxR-CE, and R2R-PE benchmarks.

02

Demonstrates superior environment-action understanding in real-robot tests.

03

Employs a flywheel-style training mechanism with on-policy trajectories.

Abstract

Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://abdd.top/latentpilot
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.