SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition
Zhaoxin Fan, Zhenbo Song, Hongyan Liu, Zhiwu Lu, Jun He, Xiaoyong, Du

TL;DR
SVT-Net is a super lightweight 3D point cloud model that effectively captures both local and long-range features for large scale place recognition, achieving state-of-the-art accuracy with minimal model size.
Contribution
The paper introduces SVT-Net, a novel lightweight network combining Atom-based and Cluster-based Sparse Voxel Transformers for improved place recognition.
Findings
Achieves state-of-the-art accuracy on benchmark datasets.
Maintains high speed with a model size of only 0.9M.
Simplified versions further reduce size to 0.8M and 0.4M while preserving performance.
Abstract
Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Although many models have been proposed and have achieved good performance by learning short-range local features, long-range contextual properties have often been neglected. Moreover, the model size has also become a bottleneck for their wide applications. To overcome these challenges, we propose a super light-weight network model termed SVT-Net for large scale place recognition. Specifically, on top of the highly efficient 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short-range local features and long-range contextual features in this model. Consisting of ASVT and CSVT, SVT-Net can achieve state-of-the-art on benchmark datasets in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRobotics and Sensor-Based Localization · 3D Surveying and Cultural Heritage · Indoor and Outdoor Localization Technologies
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Multi-Head Attention · Adam · Layer Normalization · Residual Connection · Label Smoothing · Byte Pair Encoding · Dropout
