# Introducing Depth into Transformer-based 3D Object Detection

**Authors:** Hao Zhang, Hongyang Li, Ailing Zeng, Feng Li, Shilong Liu, Xingyu, Liao, Lei Zhang

arXiv: 2302.13002 · 2023-06-06

## TL;DR

This paper introduces DAT, a depth-aware transformer framework for camera-based 3D detection that reduces depth errors and duplicate predictions by integrating depth information into attention mechanisms and auxiliary loss functions.

## Contribution

The paper proposes novel DA-SCA and DNS modules that incorporate depth into transformer-based 3D detection, improving accuracy across multiple models.

## Key findings

- DAT improves NDS by +2.8 on nuScenes val
- Achieves 60.0 NDS and 51.5 mAP on nuScenes test with VoVNet-99 backbone
- Enhances performance of BEVFormer, DETR3D, and PETR models

## Abstract

In this paper, we present DAT, a Depth-Aware Transformer framework designed for camera-based 3D detection. Our model is based on observing two major issues in existing methods: large depth translation errors and duplicate predictions along depth axes. To mitigate these issues, we propose two key solutions within DAT. To address the first issue, we introduce a Depth-Aware Spatial Cross-Attention (DA-SCA) module that incorporates depth information into spatial cross-attention when lifting image features to 3D space. To address the second issue, we introduce an auxiliary learning task called Depth-aware Negative Suppression loss. First, based on their reference points, we organize features as a Bird's-Eye-View (BEV) feature map. Then, we sample positive and negative features along each object ray that connects an object and a camera and train the model to distinguish between them. The proposed DA-SCA and DNS methods effectively alleviate these two problems. We show that DAT is a versatile method that enhances the performance of all three popular models, BEVFormer, DETR3D, and PETR. Our evaluation on BEVFormer demonstrates that DAT achieves a significant improvement of +2.8 NDS on nuScenes val under the same settings. Moreover, when using pre-trained VoVNet-99 as the backbone, DAT achieves strong results of 60.0 NDS and 51.5 mAP on nuScenes test. Our code will be soon.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.13002/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/2302.13002/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/2302.13002/full.md

---
Source: https://tomesphere.com/paper/2302.13002