Semantic Abstraction: Open-World 3D Scene Understanding from 2D   Vision-Language Models

Huy Ha; Shuran Song

arXiv:2207.11514·cs.CV·December 7, 2022·21 cites

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

Huy Ha, Shuran Song

PDF

Open Access 1 Repo

TL;DR

This paper introduces SemAbs, a framework that enhances 2D vision-language models with 3D spatial reasoning to improve open-world 3D scene understanding, enabling generalization to new vocabulary and real-world data.

Contribution

SemAbs is a novel approach that combines relevancy maps from CLIP with 3D reasoning to extend 2D models for open-world 3D scene understanding tasks.

Findings

01

SemAbs generalizes to unseen vocabulary and domains.

02

It effectively completes partially observed objects.

03

It localizes hidden objects from language descriptions.

Abstract

We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

columbia-ai-robotics/semantic-abstraction
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training