Extracting Zero-shot Common Sense from Large Language Models for Robot 3D Scene Understanding
William Chen, Siyi Hu, Rajat Talak, Luca Carlone

TL;DR
This paper presents a zero-shot method leveraging large language models to enhance semantic 3D scene understanding in robotics, enabling labeling of rooms and objects without task-specific training.
Contribution
It introduces a novel zero-shot approach that uses language models for scene labeling, generalizing to unseen objects and room types without prior task-specific data.
Findings
Effective zero-shot labeling of rooms and objects
No need for task-specific pre-training
Generalizes to unseen labels
Abstract
Semantic 3D scene understanding is a problem of critical importance in robotics. While significant advances have been made in simultaneous localization and mapping algorithms, robots are still far from having the common sense knowledge about household objects and their locations of an average human. We introduce a novel method for leveraging common sense embedded within large language models for labelling rooms given the objects contained within. This algorithm has the added benefits of (i) requiring no task-specific pre-training (operating entirely in the zero-shot regime) and (ii) generalizing to arbitrary room and object labels, including previously-unseen ones -- both of which are highly desirable traits in robotic scene understanding algorithms. The proposed algorithm operates on 3D scene graphs produced by modern spatial perception systems, and we hope it will pave the way to more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
