BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera   Images via Spatiotemporal Transformers

Zhiqi Li; Wenhai Wang; Hongyang Li; Enze Xie; Chonghao Sima; Tong Lu,; Qiao Yu; Jifeng Dai

arXiv:2203.17270·cs.CV·July 14, 2022·38 cites

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu,, Qiao Yu, Jifeng Dai

PDF

Open Access 3 Repos 1 Models

TL;DR

BEVFormer introduces a novel spatiotemporal transformer framework that learns unified bird's-eye-view representations from multi-camera images, significantly improving 3D perception accuracy for autonomous driving tasks.

Contribution

It is the first to integrate spatial and temporal attention mechanisms in a unified BEV framework for multi-camera perception.

Findings

01

Achieves 56.9% NDS on nuScenes test set, surpassing previous methods.

02

Improves velocity estimation accuracy and object recall in low visibility conditions.

03

Performs on par with LiDAR-based methods without using LiDAR data.

Abstract

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
AXERA-TECH/bevformer
model· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Robotics and Sensor-Based Localization