TL;DR
SkillNav introduces a modular, skill-based framework for Vision-and-Language Navigation, enhancing interpretability and generalization in complex, unseen environments through structured reasoning and a novel routing mechanism.
Contribution
The paper presents SkillNav, a new modular approach that decomposes navigation into interpretable skills and employs a dynamic VLM-based router for improved generalization.
Findings
Achieves state-of-the-art generalization on GSA-R2R benchmark.
Constructs a synthetic dataset pipeline for skill-specific instruction-trajectory pairs.
Demonstrates competitive results on standard VLN benchmarks.
Abstract
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
