DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

Haiyang Wang; Chen Shi; Shaoshuai Shi; Meng Lei; Sen Wang; Di He,; Bernt Schiele; Liwei Wang

arXiv:2301.06051·cs.CV·March 21, 2023·6 cites

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He,, Bernt Schiele, Liwei Wang

PDF

Open Access 4 Repos 1 Models

TL;DR

DSVT introduces a novel sparse voxel transformer backbone for 3D perception that efficiently models long-range relationships in sparse point clouds, achieving state-of-the-art performance with real-time inference speed.

Contribution

The paper proposes Dynamic Sparse Voxel Transformer (DSVT), a new transformer-based method with dynamic sparse window attention and rotated set partitioning for efficient 3D perception.

Findings

01

Achieves state-of-the-art results on 3D perception tasks.

02

Supports real-time inference at 27Hz with TensorRT.

03

Effective encoding of geometric information without custom CUDA operations.

Abstract

Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
gntmky/mm3dtest
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Vision and Imaging

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dropout · Softmax · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing