RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied   Navigation

Mingfei Han; Liang Ma; Kamila Zhumakhanova; Ekaterina Radionova,; Jingyi Zhang; Xiaojun Chang; Xiaodan Liang; Ivan Laptev

arXiv:2412.08591·cs.CV·March 20, 2025

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova,, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan Laptev

PDF

Open Access 1 Models 2 Datasets

TL;DR

RoomTour3D introduces a large-scale, diverse video-instruction dataset from web-based room tours, enhancing vision-and-language navigation models and enabling zero-shot navigation in real-world indoor environments.

Contribution

The paper presents RoomTour3D, a novel dataset derived from online videos with 3D reconstructions, improving VLN training and zero-shot navigation capabilities.

Findings

01

Significant performance improvements on multiple VLN benchmarks.

02

Enables development of zero-shot VLN agents.

03

Provides diverse, real-world indoor navigation data.

Abstract

Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$ 100K open-ended description-enriched trajectories with $\sim$ 200K instructions, and 17K…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
roomtour3d/roomtour3d-navillm-models
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Video Analysis and Summarization