BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving
Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie, Zhou, Jiwen Lu

TL;DR
BEVerse is a unified multi-task framework that leverages spatio-temporal BEV representations from multi-camera videos to improve perception and prediction in autonomous driving, outperforming single-task methods.
Contribution
The paper introduces BEVerse, a novel unified framework that jointly performs perception and prediction using multi-camera BEV representations with innovative modules like grid sampler and iterative flow.
Findings
Outperforms existing single-task methods on nuScenes
Improves 3D object detection and semantic map construction
Enhances motion prediction accuracy
Abstract
In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. After the ego-motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. Finally, multiple task decoders are attached for joint reasoning and prediction. Within the decoders, we propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. Also, we design the method of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Video Surveillance and Tracking Methods
