Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
Huy Ha, Shuran Song

TL;DR
This paper introduces SemAbs, a framework that enhances 2D vision-language models with 3D spatial reasoning to improve open-world 3D scene understanding, enabling generalization to new vocabulary and real-world data.
Contribution
SemAbs is a novel approach that combines relevancy maps from CLIP with 3D reasoning to extend 2D models for open-world 3D scene understanding tasks.
Findings
SemAbs generalizes to unseen vocabulary and domains.
It effectively completes partially observed objects.
It localizes hidden objects from language descriptions.
Abstract
We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
