SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation
Jingzhi Huang, Junkai Huang, Wenxuan Song, Haoyang Yang, Hailong Huang, Haoang Li, Yi Wang

TL;DR
SEDualVLN introduces a dual-system framework combining spatially-aware vision-language models and large language models with mapping for improved vision-language navigation, achieving state-of-the-art results.
Contribution
The paper presents SEDualVLN, a novel dual-system approach that enhances spatial reasoning and cooperation between models for better navigation performance.
Findings
Achieves state-of-the-art results on VLN-CE benchmarks.
Demonstrates effectiveness of spatial enhancements in both systems.
Shows improved long-horizon navigation and planning capabilities.
Abstract
Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
