Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Xiang Li; Zhangchi Hu; Xiao Xu; Bin Kong

arXiv:2507.16861·cs.CV·March 20, 2026

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong

PDF

TL;DR

This paper introduces a novel multi-step approach to improve 3D detection by aligning LiDAR and camera data using 2D priors, resulting in state-of-the-art performance on multiple datasets.

Contribution

It proposes a new framework combining PGDC, DAGF, and SGDM to pre-align and fuse cross-modal features, addressing spatial misalignment issues in 3D perception.

Findings

01

Achieves SOTA mAP of 71.5% on nuScenes

02

Attains 73.6% NDS on nuScenes validation set

03

Secures 41.7% mAP on Argoverse 2

Abstract

Integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect. The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.