D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation
Zihan Wang, Seungjun Lee, Guangzhao Dai, Gim Hee Lee

TL;DR
D3D-VLP introduces a unified 3D vision-language-planning framework with innovative reasoning and learning strategies, enabling embodied agents to perform complex navigation and grounding tasks effectively in real and synthetic environments.
Contribution
The paper presents a novel unified model with a dynamic reasoning pipeline and a hybrid supervision strategy, advancing interpretability and performance in embodied 3D vision-language tasks.
Findings
Achieved state-of-the-art results on multiple navigation benchmarks.
Constructed a large-scale hybrid dataset with 10 million samples.
Validated effectiveness through real-world mobile manipulation experiments.
Abstract
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
