Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation

Bolei Chen; Jiaxu Kang; Yifei Wang; Ping Zhong; Qi Wu; Jianxin Wang

arXiv:2507.21450·cs.CV·July 30, 2025

Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation

Bolei Chen, Jiaxu Kang, Yifei Wang, Ping Zhong, Qi Wu, Jianxin Wang

PDF

TL;DR

This paper introduces a recursive visual imagination and adaptive linguistic grounding approach to improve vision language navigation by better organizing visual observations and aligning them with commands, leading to more accurate navigation in complex scenes.

Contribution

The paper proposes a novel recursive visual imagination method and adaptive linguistic grounding technique to enhance scene understanding and command alignment in VLN tasks.

Findings

01

Outperforms state-of-the-art on VLN-CE and ObjectNav benchmarks.

02

Improves scene representation by focusing on semantic layouts.

03

Enhances command comprehension through fine-grained semantic matching.

Abstract

Vision Language Navigation (VLN) typically requires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands. Such tasks require organizing historical visual observations for linguistic grounding, which is critical for long-sequence navigational decisions. However, current agents suffer from overly detailed scene representation and ambiguous vision-language alignment, which weaken their comprehension of navigation-friendly high-level scene priors and easily lead to behaviors that violate linguistic commands. To tackle these issues, we propose a navigation policy by recursively summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding. In particular, by structurally modeling historical trajectories as compact neural grids, several Recursive Visual Imagination (RVI) techniques…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.