BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang, Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai

TL;DR
This paper introduces a novel bird's-eye-view detector with perspective supervision that improves convergence and compatibility with modern image backbones, achieving state-of-the-art results on nuScenes.
Contribution
It proposes a two-stage BEV detector with perspective supervision, enabling better integration of modern image backbones and faster, more effective training.
Findings
Achieves new state-of-the-art on nuScenes dataset.
Compatible with a wide range of image backbones.
Faster convergence and improved performance.
Abstract
We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications · Domain Adaptation and Few-Shot Learning
MethodsBatch Normalization · Max Pooling · 1x1 Convolution · Concatenated Skip Connection · Convolution · One-Shot Aggregation · VoVNet
