Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation
Zihan Wang, Seungjun Lee, Gim Hee Lee

TL;DR
Dynam3D introduces a dynamic layered 3D representation model that enhances vision-and-language navigation by improving spatial understanding, exploration, and adaptability in changing environments, achieving state-of-the-art results.
Contribution
The paper presents Dynam3D, a novel hierarchical 3D representation that dynamically updates and encodes spatial information for improved VLN performance.
Findings
Sets new state-of-the-art on VLN benchmarks
Effective in real-world robot navigation scenarios
Enhances exploration and long-term memory capabilities
Abstract
Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large models (Video-VLMs) with strong generalization capabilities and rich commonsense knowledge have shown remarkable performance when applied to VLN tasks. However, these models still encounter the following challenges when applied to real-world 3D navigation: 1) Insufficient understanding of 3D geometry and spatial semantics; 2) Limited capacity for large-scale exploration and long-term environmental memory; 3) Poor adaptability to dynamic and changing environments.To address these limitations, we propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Spatial Cognition and Navigation
MethodsContrastive Language-Image Pre-training
