SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation

Jingzhi Huang; Junkai Huang; Wenxuan Song; Haoyang Yang; Hailong Huang; Haoang Li; Yi Wang

arXiv:2605.17249·cs.RO·May 19, 2026

SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation

Jingzhi Huang, Junkai Huang, Wenxuan Song, Haoyang Yang, Hailong Huang, Haoang Li, Yi Wang

PDF

TL;DR

SEDualVLN introduces a dual-system framework combining spatially-aware vision-language models and large language models with mapping for improved vision-language navigation, achieving state-of-the-art results.

Contribution

The paper presents SEDualVLN, a novel dual-system approach that enhances spatial reasoning and cooperation between models for better navigation performance.

Findings

01

Achieves state-of-the-art results on VLN-CE benchmarks.

02

Demonstrates effectiveness of spatial enhancements in both systems.

03

Shows improved long-horizon navigation and planning capabilities.

Abstract

Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.