Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

Dongsheng Yang; Yinfeng Yu; Liejun Wang

arXiv:2603.26859·cs.CV·March 31, 2026

Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

Dongsheng Yang, Yinfeng Yu, Liejun Wang

PDF

1 Repo

TL;DR

This paper introduces BTK, a multimodal knowledge base framework that enhances vision-and-language navigation by integrating textual and image knowledge bases, significantly improving navigation accuracy.

Contribution

BTK is the first framework to synergistically combine environment-specific textual and image knowledge bases for VLN, improving semantic grounding and cross-modal alignment.

Findings

01

SR increased by 5% on R2R and 2.07% on REVERIE datasets.

02

SPL increased by 4% on R2R and 3.69% on REVERIE datasets.

03

Significant performance improvements over existing baselines.

Abstract

Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yds3/IPM-BTK
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.