Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
Aleksandar Jevti\'c, Christoph Reich, Felix Wimbauer, Oliver Hahn, Christian Rupprecht, Stefan Roth, Daniel Cremers

TL;DR
SceneDINO introduces an unsupervised method for semantic scene completion from a single image, leveraging self-supervised learning and multi-view consistency to infer 3D geometry and semantics without ground-truth annotations.
Contribution
It presents a novel feed-forward approach that adapts self-supervised techniques for 3D scene understanding, achieving state-of-the-art accuracy without supervision.
Findings
State-of-the-art segmentation accuracy in unsupervised SSC
Linear probing matches supervised SSC performance
Demonstrates strong domain generalization and multi-view consistency
Abstract
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Robotics and Sensor-Based Localization
MethodsVision Transformer · self-DIstillation with NO labels
