AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection
Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinhong Jiang,, Feng Zhao

TL;DR
AutoAlignV2 introduces a fast, efficient multi-modal 3D object detection framework that effectively combines point clouds and RGB images using deformable feature aggregation, achieving state-of-the-art results on nuScenes.
Contribution
It proposes a novel Cross-Domain DeformCAFA module for cross-modal feature aggregation and a dynamic inference scheme, significantly improving speed and accuracy over previous methods.
Findings
Achieves 72.4 NDS on nuScenes test leaderboard.
Outperforms previous multi-modal 3D detectors in accuracy.
Demonstrates improved efficiency and robustness in multi-modal fusion.
Abstract
Point clouds and RGB images are two general perceptional sources in autonomous driving. The former can provide accurate localization of objects, and the latter is denser and richer in semantic information. Recently, AutoAlign presents a learnable paradigm in combining these two modalities for 3D object detection. However, it suffers from high computational cost introduced by the global-wise attention. To solve the problem, we propose Cross-Domain DeformCAFA module in this work. It attends to sparse learnable sampling points for cross-modal relational modeling, which enhances the tolerance to calibration error and greatly speeds up the feature aggregation across different modalities. To overcome the complex GT-AUG under multi-modal settings, we design a simple yet effective cross-modal augmentation strategy on convex combination of image patches given their depth information. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Visual Attention and Saliency Detection
MethodsTest · Dropout
