Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

Mingfei Han; Haihong Hao; Liang Ma; Kamila Zhumakhanova; Ekaterina Radionova; Jingyi Zhang; Xiaojun Chang; Xiaodan Liang; Ivan Laptev

arXiv:2603.09259·cs.CV·March 11, 2026

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

Mingfei Han, Haihong Hao, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan Laptev

PDF

Open Access

TL;DR

This paper introduces a scalable web video-based framework for vision-and-language navigation that leverages implicit geometry representations from RGB frames, significantly improving performance and robustness in real-world indoor environments.

Contribution

It presents a novel large-scale web video dataset with rich annotations and incorporates implicit geometry representations to enhance spatial understanding without fragile 3D reconstructions.

Findings

01

Achieves new state-of-the-art results on multiple VLN benchmarks.

02

Enables robust zero-shot navigation in diverse indoor environments.

03

Utilizes large-scale web videos effectively for embodied navigation.

Abstract

Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Robot Manipulation and Learning