OcTr: Octree-based Transformer for 3D Object Detection
Chao Zhou, Yanan Zhang, Jiaxin Chen, Di Huang

TL;DR
OcTr introduces an octree-based Transformer for 3D object detection that efficiently captures global context and improves accuracy in large-scale LiDAR scenes, especially for distant and occluded objects.
Contribution
The paper proposes a novel octree-based Transformer architecture with a hybrid positional embedding for enhanced 3D object detection.
Findings
Achieves state-of-the-art results on Waymo and KITTI datasets.
Effectively balances accuracy and computational efficiency.
Captures rich global context through a coarse-to-fine octree structure.
Abstract
A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes especially for distant or/and occluded objects. Albeit recent efforts made by Transformers with the long sequence modeling capability, they fail to properly balance the accuracy and efficiency, suffering from inadequate receptive fields or coarse-grained holistic correlations. In this paper, we propose an Octree-based Transformer, named OcTr, to address this issue. It first constructs a dynamic octree on the hierarchical feature pyramid through conducting self-attention on the top level and then recursively propagates to the level below restricted by the octants, which captures rich global context in a coarse-to-fine manner while maintaining the computational complexity under control. Furthermore, for enhanced foreground perception, we propose a hybrid positional embedding,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · fail · Linear Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Dropout · Layer Normalization · Dense Connections
