AgentVLN: Towards Agentic Vision-and-Language Navigation

Zihao Xin; Wentong Li; Yixuan Jiang; Ziyuan Huang; Bin Wang; Piji Li; Jianke Zhu; Jie Qin; Shengjun Huang

arXiv:2603.17670·cs.RO·March 19, 2026

AgentVLN: Towards Agentic Vision-and-Language Navigation

Zihao Xin, Wentong Li, Yixuan Jiang, Ziyuan Huang, Bin Wang, Piji Li, Jianke Zhu, Jie Qin, Shengjun Huang

PDF

Open Access 1 Datasets

TL;DR

AgentVLN introduces a novel, efficient embodied navigation framework that leverages a VLM-as-Brain paradigm, cross-space representation mapping, and active exploration strategies to improve long-horizon vision-and-language navigation in unseen environments.

Contribution

It proposes a new framework for VLN that decouples semantic reasoning from perception, introduces a cross-space mapping for better perception integration, and develops a large-scale instruction-tuning dataset.

Findings

01

Outperforms prior SOTA on long-horizon VLN benchmarks.

02

Enables lightweight deployment on edge platforms.

03

Improves long-term navigation accuracy and robustness.

Abstract

Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

allenxinn/AgentVLN-Instruct
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Social Robot Interaction and HRI