SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

Pengna Li; Kangyi Wu; Shaoqing Xu; Fang Li; Hanbing Li; Lin Zhao; Kailin Lyu; Long Chen; Zhi-Xin Yang; Nanning Zheng

arXiv:2604.27620·cs.CV·May 1, 2026

SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Hanbing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi-Xin Yang, Nanning Zheng

PDF

1 Repo

TL;DR

SpaAct introduces spatial activation tasks and curriculum learning to enhance vision-language models for navigation, achieving state-of-the-art results in unseen environments.

Contribution

It proposes a novel training framework with spatial activation tasks and a progressive curriculum to improve dynamic spatial awareness in VLN models.

Findings

01

SpaAct improves navigation performance on VLN-CE benchmarks.

02

The framework enhances backward action reasoning and forward transition prediction.

03

State-of-the-art results are achieved with VLM-based navigation.

Abstract

Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring such awareness, namely backward action reasoning (why) and forward transition prediction~(how). Based on this insight, we propose SpaAct, a simple yet effective training framework that activates the dynamic spatial awareness in VLMs. Specifically, SpaAct introduces two spatial activation tasks: Action Retrospection, which asks the model to infer the executed action sequence from visual transitions, and Future Frame Selection, which forces the model to predict the visual transitions conditioned on history and action. These two objectives provide lightweight supervision on both backward action reasoning and forward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.