TL;DR
This paper introduces Dual-Anchoring, a framework that improves vision-language navigation by explicitly addressing progress and memory drift, leading to significant performance gains in complex environments.
Contribution
The paper proposes a novel Dual-Anchoring Framework with instruction progress and memory landmark anchoring, along with large datasets for training and evaluation.
Findings
Achieved 15.2% improvement in Success Rate.
Gained 24.7% on long-horizon trajectories.
Demonstrated effectiveness in both simulation and real-world environments.
Abstract
Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
