OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

Zhangyang Qi; Jiaqi Wang; Xiaoyang Wu; Hengshuang Zhao

arXiv:2306.01738·cs.CV·June 5, 2023·2 cites

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

Zhangyang Qi, Jiaqi Wang, Xiaoyang Wu, Hengshuang Zhao

PDF

Open Access

TL;DR

OCBEV introduces an object-centric BEV transformer that enhances multi-view 3D detection by effectively modeling moving objects, leading to state-of-the-art results and faster training convergence on nuScenes.

Contribution

The paper proposes OCBEV, a novel object-centric BEV transformer with three key designs for improved temporal and spatial modeling of moving objects in 3D detection.

Findings

01

Achieves state-of-the-art 1.5 NDS points on nuScenes

02

Faster convergence, requiring half the training iterations

03

Outperforms traditional BEVFormer in accuracy and efficiency

Abstract

Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm, which benefits from both BEV's strong perception power and end-to-end pipeline. Despite achieving substantial progress, existing works model objects via globally leveraging temporal and spatial information of BEV features, resulting in problems when handling the challenging complex and dynamic autonomous driving scenarios. In this paper, we proposed an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively. OCBEV comprises three designs: Object Aligned Temporal Fusion aligns the BEV feature based on ego-motion and estimated current locations of moving objects, leading to a precise instance-level feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings