DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He,, Bernt Schiele, Liwei Wang

TL;DR
DSVT introduces a novel sparse voxel transformer backbone for 3D perception that efficiently models long-range relationships in sparse point clouds, achieving state-of-the-art performance with real-time inference speed.
Contribution
The paper proposes Dynamic Sparse Voxel Transformer (DSVT), a new transformer-based method with dynamic sparse window attention and rotated set partitioning for efficient 3D perception.
Findings
Achieves state-of-the-art results on 3D perception tasks.
Supports real-time inference at 27Hz with TensorRT.
Effective encoding of geometric information without custom CUDA operations.
Abstract
Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dropout · Softmax · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing
