TL;DR
TagaVLM introduces a topology-aware framework for vision-language navigation, explicitly integrating spatial structures into large models to improve global action reasoning and navigation performance.
Contribution
It proposes novel topological modules, STAR-Att and navigation prompts, to enhance spatial reasoning in VLMs for embodied navigation tasks.
Findings
Achieves state-of-the-art results on R2R benchmark with 51.09% SR.
Outperforms prior methods by 3.39% SR and 9.08 SPL in unseen environments.
Demonstrates that targeted enhancements on smaller models can surpass brute-force scaling.
Abstract
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
