OnlineBEV: Recurrent Temporal Fusion in Bird's Eye View Representations for Multi-Camera 3D Perception
Junho Koh, Youngwoo Lee, Jungho Kim, Dongyoung Lee, Jun Won Choi

TL;DR
OnlineBEV introduces a recurrent temporal fusion approach with motion-guided alignment and consistency learning, significantly improving multi-camera 3D perception performance in autonomous driving scenarios.
Contribution
It proposes a novel recurrent fusion method with motion-guided alignment and a consistency loss, enabling effective temporal feature aggregation for 3D perception.
Findings
Achieves 63.9% NDS on nuScenes, surpassing previous methods.
Employs motion-guided BEV fusion for accurate temporal alignment.
Demonstrates state-of-the-art results in camera-only 3D detection.
Abstract
Multi-view camera-based 3D perception can be conducted using bird's eye view (BEV) features obtained through perspective view-to-BEV transformations. Several studies have shown that the performance of these 3D perception methods can be further enhanced by combining sequential BEV features obtained from multiple camera frames. However, even after compensating for the ego-motion of an autonomous agent, the performance gain from temporal aggregation is limited when combining a large number of image frames. This limitation arises due to dynamic changes in BEV features over time caused by object motion. In this paper, we introduce a novel temporal 3D perception method called OnlineBEV, which combines BEV features over time using a recurrent structure. This structure increases the effective number of combined features with minimal memory usage. However, it is critical to spatially align the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
