Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

Haoxuan Xu; Tianfu Li; Wenbo Chen; Yi Liu; Xingxing Zuo; Yaoxian Song; Haoang Li

arXiv:2602.23937·cs.RO·March 2, 2026

Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos

Haoxuan Xu, Tianfu Li, Wenbo Chen, Yi Liu, Xingxing Zuo, Yaoxian Song, Haoang Li

PDF

Open Access

TL;DR

This paper introduces a multimodal event knowledge graph and a hierarchical retrieval method to improve vision-language navigation, enabling agents to better handle long-horizon reasoning and ambiguous instructions in unseen indoor environments.

Contribution

It presents the first large-scale multimodal spatiotemporal knowledge graph and a novel retrieval mechanism that enhances VLN agents' reasoning capabilities.

Findings

01

Outperforms state-of-the-art methods on REVERIE, R2R, and R2R-CE benchmarks.

02

Effectively retrieves causal event sequences for improved navigation.

03

Demonstrates the benefit of multimodal episodic memory in complex reasoning tasks.

Abstract

Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)