LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering
Simon Boeder, Fabian Gigengack, Benjamin Risse

TL;DR
LangOcc introduces a self-supervised, vision-language aligned 3D occupancy estimation method that detects arbitrary semantics from camera images, outperforming LiDAR-based methods without requiring explicit geometry supervision.
Contribution
The paper presents a novel open vocabulary occupancy estimation approach using vision-language alignment and differentiable volume rendering, trained solely on images without explicit geometry labels.
Findings
Outperforms LiDAR-supervised methods in open vocabulary occupancy.
Achieves state-of-the-art results in semantic occupancy estimation on Occ3D-nuScenes.
Operates effectively without explicit geometry supervision.
Abstract
The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
