BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection
Zhenxin Li, Shiyi Lan, Jose M. Alvarez, Zuxuan Wu

TL;DR
BEVNeXt revitalizes dense BEV frameworks for 3D object detection by integrating CRF-based depth estimation, temporal aggregation, and a two-stage decoder, achieving state-of-the-art results on nuScenes.
Contribution
The paper introduces BEVNeXt, a modernized dense BEV framework with novel components that enhance depth estimation and object localization, surpassing existing methods.
Findings
Achieves 64.2 NDS on nuScenes test set.
Outperforms both dense BEV and query-based frameworks.
Demonstrates superior depth estimation and localization accuracy.
Abstract
Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a "modernized" dense BEV framework dubbed BEVNeXt. On the nuScenes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Image Processing and 3D Reconstruction · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Linear Layer · Attention Is All You Need · Absolute Position Encodings · Dropout · Dense Connections · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer
