LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume   Rendering

Simon Boeder; Fabian Gigengack; Benjamin Risse

arXiv:2407.17310·cs.CV·July 26, 2024

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

Simon Boeder, Fabian Gigengack, Benjamin Risse

PDF

TL;DR

LangOcc introduces a self-supervised, vision-language aligned 3D occupancy estimation method that detects arbitrary semantics from camera images, outperforming LiDAR-based methods without requiring explicit geometry supervision.

Contribution

The paper presents a novel open vocabulary occupancy estimation approach using vision-language alignment and differentiable volume rendering, trained solely on images without explicit geometry labels.

Findings

01

Outperforms LiDAR-supervised methods in open vocabulary occupancy.

02

Achieves state-of-the-art results in semantic occupancy estimation on Occ3D-nuScenes.

03

Operates effectively without explicit geometry supervision.

Abstract

The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training