Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

Zihan Wang; Seungjun Lee; Gim Hee Lee

arXiv:2505.11383·cs.CV·May 19, 2025

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

Zihan Wang, Seungjun Lee, Gim Hee Lee

PDF

Open Access 1 Repo

TL;DR

Dynam3D introduces a dynamic layered 3D representation model that enhances vision-and-language navigation by improving spatial understanding, exploration, and adaptability in changing environments, achieving state-of-the-art results.

Contribution

The paper presents Dynam3D, a novel hierarchical 3D representation that dynamically updates and encodes spatial information for improved VLN performance.

Findings

01

Sets new state-of-the-art on VLN benchmarks

02

Effective in real-world robot navigation scenarios

03

Enhances exploration and long-term memory capabilities

Abstract

Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large models (Video-VLMs) with strong generalization capabilities and rich commonsense knowledge have shown remarkable performance when applied to VLN tasks. However, these models still encounter the following challenges when applied to real-world 3D navigation: 1) Insufficient understanding of 3D geometry and spatial semantics; 2) Limited capacity for large-scale exploration and long-term environmental memory; 3) Poor adaptability to dynamic and changing environments.To address these limitations, we propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mrzihan/dynam3d
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Spatial Cognition and Navigation

MethodsContrastive Language-Image Pre-training