DenseBEV: Transforming BEV Grid Cells into 3D Objects
Marius D\"ahling, Sebastian Krebs, J. Marius Z\"ollner

TL;DR
DenseBEV introduces a novel end-to-end approach for 3D object detection using BEV feature cells as anchors, improving detection accuracy especially for small objects and achieving state-of-the-art results on major datasets.
Contribution
The paper proposes using BEV feature cells directly as anchors and incorporates a hybrid temporal modeling approach, enhancing efficiency and detection performance in multi-camera 3D object detection.
Findings
Significant improvements in NDS and mAP on nuScenes.
Enhanced pedestrian detection with 3.8% mAP increase.
State-of-the-art performance on Waymo dataset with 60.7% LET-mAP.
Abstract
In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
