Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding
Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai, Wang, Xin Tong, Baining Guo

TL;DR
Swin3D introduces a pretrained 3D transformer backbone for indoor scene understanding, achieving superior performance on segmentation and detection tasks by leveraging scalable self-attention and a novel positional embedding scheme.
Contribution
The paper presents a scalable 3D Swin transformer backbone pretrained on synthetic data, improving 3D scene understanding with novel positional embeddings and efficient self-attention.
Findings
Outperforms state-of-the-art on multiple 3D datasets
Pretrained on large synthetic dataset for better generalization
Efficient self-attention with linear memory complexity
Abstract
The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage · Advanced Neural Network Applications
MethodsAttention Is All You Need · Softmax · Stochastic Depth · Linear Layer · Layer Normalization · Dense Connections · Multi-Head Attention · Residual Connection · Swin Transformer
