BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection
Junjie Huang, Guan Huang

TL;DR
BEVDet4D introduces a temporal-aware extension to the BEVDet framework, leveraging multi-frame data to significantly improve 3D object detection accuracy and velocity estimation in multi-camera systems.
Contribution
The paper proposes BEVDet4D, a novel 4D spatial-temporal framework that enhances multi-camera 3D detection by incorporating temporal cues with minimal additional computation.
Findings
Reduces velocity error by up to 62.9%.
Achieves 54.5% NDS on nuScenes, surpassing previous methods.
Enables vision-based detection to be comparable with LiDAR/radar in velocity estimation.
Abstract
Single frame data contains finite information which limits the performance of the existing vision-based multi-camera 3D object detection paradigms. For fundamentally pushing the performance boundary in this area, a novel paradigm dubbed BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4D space. We upgrade the naive BEVDet framework with a few modifications just for fusing the feature from the previous frame with the corresponding one in the current frame. In this way, with negligible additional computing budget, we enable BEVDet4D to access the temporal cues by querying and comparing the two candidate features. Beyond this, we simplify the task of velocity prediction by removing the factors of ego-motion and time in the learning target. As a result, BEVDet4D with robust generalization performance reduces the velocity error…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Advanced Optical Sensing Technologies
