TL;DR
This paper introduces BTK, a multimodal knowledge base framework that enhances vision-and-language navigation by integrating textual and image knowledge bases, significantly improving navigation accuracy.
Contribution
BTK is the first framework to synergistically combine environment-specific textual and image knowledge bases for VLN, improving semantic grounding and cross-modal alignment.
Findings
SR increased by 5% on R2R and 2.07% on REVERIE datasets.
SPL increased by 4% on R2R and 3.69% on REVERIE datasets.
Significant performance improvements over existing baselines.
Abstract
Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
