Unified Semantic Transformer for 3D Scene Understanding
Sebastian Koch, Johanna Wald, Hidenobu Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari

TL;DR
UNITE is a unified 3D scene understanding model that efficiently predicts multiple semantic attributes from RGB images, surpassing task-specific models in performance and operating fully end-to-end.
Contribution
The paper introduces UNITE, a novel feed-forward neural network that unifies diverse 3D semantic tasks within a single model, trained with self-supervision and multi-view consistency.
Findings
Achieves state-of-the-art results on multiple semantic tasks
Outperforms task-specific models and methods using ground truth 3D geometry
Operates efficiently in a fully end-to-end manner from RGB images
Abstract
Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Robotics and Sensor-Based Localization
