Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

Kadir Yilmaz; Adrian Kruse; Tristan H\"ofer; Daan de Geus; Bastian Leibe

arXiv:2604.19609·cs.CV·April 22, 2026

Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

Kadir Yilmaz, Adrian Kruse, Tristan H\"ofer, Daan de Geus, Bastian Leibe

PDF

1 Models

TL;DR

This paper introduces the Volume Transformer (Volt), a minimal modification of vanilla Transformers adapted for 3D scene understanding, achieving state-of-the-art results in semantic and instance segmentation.

Contribution

The paper demonstrates that vanilla Transformers can be effectively adapted for 3D scene understanding with minimal changes, enabling scalable and general-purpose 3D perception models.

Findings

01

Volt achieves state-of-the-art results on multiple 3D datasets.

02

Data-efficient training with augmentations and distillation is crucial for Volt's performance.

03

Volt benefits more from increased supervision scale than domain-specific backbones.

Abstract

Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self-attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KadirYilmaz/Volt
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.