AgentVLN: Towards Agentic Vision-and-Language Navigation
Zihao Xin, Wentong Li, Yixuan Jiang, Ziyuan Huang, Bin Wang, Piji Li, Jianke Zhu, Jie Qin, Shengjun Huang

TL;DR
AgentVLN introduces a novel, efficient embodied navigation framework that leverages a VLM-as-Brain paradigm, cross-space representation mapping, and active exploration strategies to improve long-horizon vision-and-language navigation in unseen environments.
Contribution
It proposes a new framework for VLN that decouples semantic reasoning from perception, introduces a cross-space mapping for better perception integration, and develops a large-scale instruction-tuning dataset.
Findings
Outperforms prior SOTA on long-horizon VLN benchmarks.
Enables lightweight deployment on edge platforms.
Improves long-term navigation accuracy and robustness.
Abstract
Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Social Robot Interaction and HRI
