Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma; Yue Zhang; Zehao Wang; Parisa Kordjamshidi

arXiv:2508.07642·cs.AI·May 14, 2026

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi

PDF

1 Video

TL;DR

SkillNav introduces a modular, skill-based framework for Vision-and-Language Navigation, enhancing interpretability and generalization in complex, unseen environments through structured reasoning and a novel routing mechanism.

Contribution

The paper presents SkillNav, a new modular approach that decomposes navigation into interpretable skills and employs a dynamic VLM-based router for improved generalization.

Findings

01

Achieves state-of-the-art generalization on GSA-R2R benchmark.

02

Constructs a synthetic dataset pipeline for skill-specific instruction-trajectory pairs.

03

Demonstrates competitive results on standard VLN benchmarks.

Abstract

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents· underline