TL;DR
LatentPilot introduces a novel approach for vision-and-language navigation that models future visual dynamics during training, enabling better decision-making without future frame access at inference.
Contribution
It proposes a training paradigm that learns action-conditioned visual dynamics and latent tokens, allowing the agent to anticipate future observations and improve navigation performance.
Findings
Achieves new state-of-the-art results on R2R-CE, RxR-CE, and R2R-PE benchmarks.
Demonstrates superior environment-action understanding in real-robot tests.
Employs a flywheel-style training mechanism with on-policy trajectories.
Abstract
Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can imagine the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices. Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
