Articulated 3D Scene Graphs for Open-World Mobile Manipulation
Martin B\"uchner, Adrian R\"ofer, Tim Engelbracht, Tim Welschehold, Zuria Bauer, Hermann Blum, Marc Pollefeys, Abhinav Valada

TL;DR
This paper introduces MoMa-SG, a framework for creating semantic-kinematic 3D scene graphs from RGB-D data, enabling robots to understand and manipulate articulated objects in complex environments.
Contribution
The paper presents a novel unified twist estimation method for modeling object articulation and introduces the Arti4D-Semantic dataset for articulated scene understanding.
Findings
MoMa-SG accurately infers object kinematics from RGB-D sequences.
The approach enables robust manipulation of articulated objects in real-world settings.
Extensive evaluation shows high performance on multiple datasets.
Abstract
Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
