Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong

TL;DR
This paper introduces a novel multi-step approach to improve 3D detection by aligning LiDAR and camera data using 2D priors, resulting in state-of-the-art performance on multiple datasets.
Contribution
It proposes a new framework combining PGDC, DAGF, and SGDM to pre-align and fuse cross-modal features, addressing spatial misalignment issues in 3D perception.
Findings
Achieves SOTA mAP of 71.5% on nuScenes
Attains 73.6% NDS on nuScenes validation set
Secures 41.7% mAP on Argoverse 2
Abstract
Integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect. The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
