VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

Shaoan Wang; Yuanfei Luo; Xingyu Chen; Aocheng Luo; Dongyue Li; Chang Liu; Sheng Chen; Yangang Zhang; Junzhi Yu

arXiv:2601.08665·cs.RO·January 14, 2026

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

Shaoan Wang, Yuanfei Luo, Xingyu Chen, Aocheng Luo, Dongyue Li, Chang Liu, Sheng Chen, Yangang Zhang, Junzhi Yu

PDF

Open Access

TL;DR

VLingNav introduces an adaptive reasoning and persistent memory framework for embodied navigation, significantly improving performance and generalization in complex, long-horizon tasks and enabling zero-shot transfer to real robots.

Contribution

The paper presents VLingNav, a novel embodied navigation model with an adaptive chain-of-thought reasoning mechanism and a visual-assisted linguistic memory, along with a large reasoning-annotated dataset and reinforcement learning training.

Findings

01

Achieves state-of-the-art results on multiple navigation benchmarks.

02

Successfully transfers to real-world robotic platforms zero-shot.

03

Demonstrates improved reasoning and memory capabilities in navigation tasks.

Abstract

VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Reinforcement Learning in Robotics