Divide and Conquer: Improving Multi-Camera 3D Perception with 2D   Semantic-Depth Priors and Input-Dependent Queries

Qi Song; Qingyong Hu; Chi Zhang; Yongquan Chen; Rui Huang

arXiv:2408.06901·cs.CV·August 14, 2024

Divide and Conquer: Improving Multi-Camera 3D Perception with 2D Semantic-Depth Priors and Input-Dependent Queries

Qi Song, Qingyong Hu, Chi Zhang, Yongquan Chen, Rui Huang

PDF

Open Access

TL;DR

This paper introduces SDTR, an input-aware Transformer framework that leverages semantic and depth priors to improve multi-camera 3D perception tasks like object detection and BEV segmentation, achieving state-of-the-art results.

Contribution

The paper proposes a novel S-D Encoder and Prior-guided Query Builder that explicitly model and incorporate semantic and depth priors, enhancing Transformer-based 3D perception.

Findings

01

Achieves state-of-the-art performance on nuScenes and Lyft benchmarks.

02

Effectively models semantic and depth priors to improve 3D perception accuracy.

03

Enhances input-awareness of queries for better learning capacity.

Abstract

3D perception tasks, such as 3D object detection and Bird's-Eye-View (BEV) segmentation using multi-camera images, have drawn significant attention recently. Despite the fact that accurately estimating both semantic and 3D scene layouts are crucial for this task, existing techniques often neglect the synergistic effects of semantic and depth cues, leading to the occurrence of classification and position estimation errors. Additionally, the input-independent nature of initial queries also limits the learning capacity of Transformer-based models. To tackle these challenges, we propose an input-aware Transformer framework that leverages Semantics and Depth as priors (named SDTR). Our approach involves the use of an S-D Encoder that explicitly models semantic and depth priors, thereby disentangling the learning process of object categorization and position estimation. Moreover, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Advanced Vision and Imaging · Visual Attention and Saliency Detection

MethodsLinear Layer · Layer Normalization · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections