Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided   Mobile Manipulation

Zhijie Yan; Shufei Li; Zuoxu Wang; Lixiu Wu; Han Wang; Jun Zhu,; Lijiang Chen; Jihong Liu

arXiv:2410.11989·cs.RO·March 20, 2025

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

Zhijie Yan, Shufei Li, Zuoxu Wang, Lixiu Wu, Han Wang, Jun Zhu,, Lijiang Chen, Jihong Liu

PDF

Open Access

TL;DR

DovSG introduces a dynamic open-vocabulary 3D scene graph framework that enables robots to adapt to changing environments for long-term language-guided manipulation tasks, using vision-language models and efficient scene updates.

Contribution

The paper presents a novel framework combining dynamic scene graphs and language-guided planning for long-term robot manipulation in changing environments.

Findings

01

Effective scene graph updates during interactions

02

Superior performance in real-world long-term tasks

03

Robust adaptation to environment changes

Abstract

Enabling mobile robots to perform long-term tasks in dynamic real-world environments is a formidable challenge, especially when the environment changes frequently due to human-robot interactions or the robot's own actions. Traditional methods typically assume static scenes, which limits their applicability in the continuously changing real world. To overcome these limitations, we present DovSG, a novel mobile manipulation framework that leverages dynamic open-vocabulary 3D scene graphs and a language-guided task planning module for long-term task execution. DovSG takes RGB-D sequences as input and utilizes vision-language models (VLMs) for object detection to obtain high-level object semantic features. Based on the segmented objects, a structured 3D scene graph is generated for low-level spatial relationships. Furthermore, an efficient mechanism for locally updating the scene graph,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Natural Language Processing Techniques · Multimodal Machine Learning Applications