Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene   Understanding

Yu-Qi Yang; Yu-Xiao Guo; Jian-Yu Xiong; Yang Liu; Hao Pan; Peng-Shuai; Wang; Xin Tong; Baining Guo

arXiv:2304.06906·cs.CV·August 17, 2023·34 cites

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai, Wang, Xin Tong, Baining Guo

PDF

Open Access 2 Repos

TL;DR

Swin3D introduces a pretrained 3D transformer backbone for indoor scene understanding, achieving superior performance on segmentation and detection tasks by leveraging scalable self-attention and a novel positional embedding scheme.

Contribution

The paper presents a scalable 3D Swin transformer backbone pretrained on synthetic data, improving 3D scene understanding with novel positional embeddings and efficient self-attention.

Findings

01

Outperforms state-of-the-art on multiple 3D datasets

02

Pretrained on large synthetic dataset for better generalization

03

Efficient self-attention with linear memory complexity

Abstract

The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage · Advanced Neural Network Applications

MethodsAttention Is All You Need · Softmax · Stochastic Depth · Linear Layer · Layer Normalization · Dense Connections · Multi-Head Attention · Residual Connection · Swin Transformer