Unified Semantic Transformer for 3D Scene Understanding

Sebastian Koch; Johanna Wald; Hidenobu Matsuki; Pedro Hermosilla; Timo Ropinski; Federico Tombari

arXiv:2512.14364·cs.CV·December 19, 2025

Unified Semantic Transformer for 3D Scene Understanding

Sebastian Koch, Johanna Wald, Hidenobu Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari

PDF

Open Access

TL;DR

UNITE is a unified 3D scene understanding model that efficiently predicts multiple semantic attributes from RGB images, surpassing task-specific models in performance and operating fully end-to-end.

Contribution

The paper introduces UNITE, a novel feed-forward neural network that unifies diverse 3D semantic tasks within a single model, trained with self-supervision and multi-view consistency.

Findings

01

Achieves state-of-the-art results on multiple semantic tasks

02

Outperforms task-specific models and methods using ground truth 3D geometry

03

Operates efficiently in a fully end-to-end manner from RGB images

Abstract

Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Robotics and Sensor-Based Localization