TL;DR
SpaAct introduces spatial activation tasks and curriculum learning to enhance vision-language models for navigation, achieving state-of-the-art results in unseen environments.
Contribution
It proposes a novel training framework with spatial activation tasks and a progressive curriculum to improve dynamic spatial awareness in VLN models.
Findings
SpaAct improves navigation performance on VLN-CE benchmarks.
The framework enhances backward action reasoning and forward transition prediction.
State-of-the-art results are achieved with VLM-based navigation.
Abstract
Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring such awareness, namely backward action reasoning (why) and forward transition prediction~(how). Based on this insight, we propose SpaAct, a simple yet effective training framework that activates the dynamic spatial awareness in VLMs. Specifically, SpaAct introduces two spatial activation tasks: Action Retrospection, which asks the model to infer the executed action sequence from visual transitions, and Future Frame Selection, which forces the model to predict the visual transitions conditioned on history and action. These two objectives provide lightweight supervision on both backward action reasoning and forward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
