TL;DR
This paper introduces the Volume Transformer (Volt), a minimal modification of vanilla Transformers adapted for 3D scene understanding, achieving state-of-the-art results in semantic and instance segmentation.
Contribution
The paper demonstrates that vanilla Transformers can be effectively adapted for 3D scene understanding with minimal changes, enabling scalable and general-purpose 3D perception models.
Findings
Volt achieves state-of-the-art results on multiple 3D datasets.
Data-efficient training with augmentations and distillation is crucial for Volt's performance.
Volt benefits more from increased supervision scale than domain-specific backbones.
Abstract
Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self-attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
