Clio: Real-time Task-Driven Open-Set 3D Scene Graphs
Dominic Maggio, Yun Chang, Nathan Hughes, Matthew Trang, Dan Griffith,, Carlyn Dougherty, Eric Cristofalo, Lukas Schmid, Luca Carlone

TL;DR
Clio is a real-time system that constructs task-specific 3D scene graphs by clustering environment primitives based on natural language tasks, improving robotic perception and task execution accuracy.
Contribution
This paper introduces a task-driven 3D scene understanding framework using the Information Bottleneck, with a real-time pipeline for hierarchical scene graph construction on robots.
Findings
Enables real-time, compact open-set 3D scene graphs
Improves task execution accuracy by focusing on relevant concepts
Demonstrates effective clustering of 3D primitives into task-relevant objects
Abstract
Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us with a fundamental question: what is the right granularity for the objects (and, more generally, for the semantic concepts) the robot has to include in its map representation? While related work implicitly chooses a level of granularity by tuning thresholds for object detection, we argue that such a choice is intrinsically task-dependent. The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · 3D Shape Modeling and Analysis · Human Pose and Action Recognition
